1 Functions TOC

A list of the functions created in this document.

ToDo:

Add: calculate_attrition_cost, count_to_pct, plot_attrition in scrips - assess_attrition

Artifact Thumbnail
Visualizations
plot_attrition
plot_ggpairs
plot_hist
plot_cor
plot_leaderboard
Precision_Recall_plot
ROC_Plot
H2O_Performance_Plot
Plot_Lime_Feature_Explain
Plot_Lime_Features_Explanations
Confusion_Matrix_Plot
Correlation_Analysis
Functions
get_cor
extract_h2o_model_name_by_position
Model_Metrics
calculate_savings_by_thresh_2
Code Snippets
Tidy_Eval Tidy_Eval quos, enquo and !!!
rlang rlang::sym turns a single character string into an expression. The expression is typically captured in enquo() or quos() to delay evaluation
anonymous If you choose not to give the function a name, you get an anonymous function.
fct_recode mutate(OverTime = fct_recode1(OverTime, ‘No’ = ‘Yes’)
case_when mutate(above_industry_avg = case_when(pct > 0.088 ~ ‘Yes’, TRUE ~ ‘No’)
gather_example Gather takes multiple columns and collapses into key-value pairs
fct_reorder2 fct_reorder() is useful for 1d displays where the factor is mapped to position; fct_reorder2() for 2d displays where the factor is mapped to a non-position aesthetic
add_column This is a convenient way to add one or more columns to an existing data frame.
partial partial allows you to preload a function with arguments that do not change. Part of purrr
firstMap secondMap mutate and map are a powerful combination enabling row-wise iteration in a tidy way:
mutate(metrics = map(model_id, get_model_performance_metrics, newdata_tbl))
expand_limits Change the range of axis values in ggplot: expand_limits(x = c(-1, 1))
scale_x_continuous scale_x_continuous(label = scales::dollar)
cross_df cross_df from purr is really helpful. From a list of variables, it produces all combinations of the list.
scale_fill_gradient2 scale_fill_gradient2 uses 2 gradient transitions by specifying low, mid and high arguments along with a midpoint to specify the middle transition point. There are variations like scale_fill_gradient that offers only a low and high point. scale_fill_gradient_n allows for any number of gradients.
str_detect filter(str_detect(terms, ‘YearsAtCompany’))

2 Employee Turnover: A $15M Problem

Differentiating between productive and non-productive employees is difficult and highly subjective. Some employees are in the top 20% (all stars), some are in the middle 60% (productive), and some are in the bottom 20%. The bottom 20% can be OK or can be poor depending on the overall level of talent in your organization. Sometimes just because they are bottom 20% they are still good employees because the organization is such a high performing organization that everyone they employee is good. However, most organizations do have bad employees.

Now, in terms of defining productivity, if on a 1 to 4 scale with 1 being poor and 4 being the best that managers rate employees 3 to 4, you can usually be OK in viewing the 4’s as productive. They may not always be all stars, especially if 50% of people reviewed get 4’s but in reality only 20% should by the bell curve. However, if a manager rates someone as a 1, there is definitely a problem. These people need to either have immediate corrective actions taken or need to be cut loose. This is good attrition because the employee is a poor job/culture fit.

The analysis we are doing specifically looks at bad attrition, losing people that are 3’s and 4’s. I call them high performers, but in reality, I should be saying productive employees. This group is probably driving 80% of the business results.

Non-profits will need to be handled based on what the business finds important. While maximizing profit may not be the goal, there are still goals. For example, one goal might be to increase high school graduation rates in schools. This could be set up showing that the percentage of student dropouts are much higher in the district of focus and that increasing graduation rates is expected to reduce crime and increase workforce productivity. Crime reduction would have a savings associated with it. Workforce productivity would have a financial benefit associated with it. These are possible ways we can show financial value for a non-profit.

Non-Profit Cost Analysis: https://www.bridgespan.org/bridgespan/Images/articles/nonprofit-cost-analysis-toolkit/NonprofitCostsAnalysisToolkit.pdf

Non-Profit Strategic Planning: https://knowhownonprofit.org/organisation/strategy

The tools are quite similar to what you might find in a for-profit scenario. The maximization will be to maximize the financial delta between cost and benefit (to public) versus cost and revenue (for organization).

2.1 What Happens When Good Employees Leave

You takeaway our top 20 employees and we become a mediocre company –Bill Gates

  • All employees turnover, some just faster
  • Impact of Attrition
    • Varies based on employee productivity
    • Can be measured
  • Bad turnover costs the organization

Employee Retention Now a Big Issue: Why the Tide has Turned

When a good employee quits, costs are incurred.

  • Quits
  • Vacancy during search
  • Cost to train new employee

Not all attrition is bad

2.2 Calculating the Cost of Turnover

  • Direct Costs
    • Average Separation (Exit Interviews, Separation Pay, Admin Costs) ($500)
    • Average Vacancy (Temp Help & OT) ($10,000)
    • Average Acquisition (Ads, Travel, Interviews, Physicals, Drug Testing, Bonuses, Reference Checks, etc) ($4,900)
    • Average Placement (New Supplies, Onboarding, Training) ($3,500)

\[500+10,000+4,900+3,500=18,900\]

  • Productivity Costs (Hidden Costs - only recognized when the problem becomes large. Much higher than direct costs.)
    • Annual revenue per Employee - less COGS (250,000)
    • Workdays per Year (240)
    • Average Workdays Position Open (40)
    • Average Onboarding/Training Period (60 ~ 12 weeks)
    • Effectiveness During Onboarding/Training (50%)

\[(250,000/240)*(40+(60*.5))=72,917\]

  • Salary & Benefits
    • Average Salary + Benefits (80,000)
    • Workdays per Year (240)
    • Average Workdays Position Open (40)

\[(80,000/240)*40=13,333\] \[Total Costs = 18,900+72,917+13,333=78,483\]

If there 200 productive employees leave each year, then \(200 * 78,483 = 15,000,000\) is the financial cost of turnover.

Just a 10% reduction (using data science) saves the organization $1.6M/year!

3 Toolbox Supporting Analysis

3.1 Integrated Data Science Frameworks: BSPF & CRISP-DM

Business Science Problem Framework

  • Framework provide
    • Consistency across projects
    • Critical Step Checklist
    • Clarity to stakeholders
    • Keeps project on track

BSPF is aligned with industry standard CRISP-DM

4 Frameworks

4.1 CRISP-DM

CRISP-DM is a high level industry-standard framework for data mining.

A high level plan that is adaptable but lacks the detail to drive data science projects

CRISP-DM Process Cycle

  1. Business Understanding
    • Critical - sets stage for all downstream events
    • Problem solving and project scope development with the business
    • Combines discovery and project management
    • Documents current state and the business success metric
  2. Data Understanding
    • 20% of project time typically consumed by data collection
  3. Data Preparation
    • 60% of your time cleaning and organizing data
  4. Modeling
    • information turned into insight
    • Model Understanding
    • Model Assessment
  5. Evaluation
    • Assess degree which the data insights meet business objectives (ROI)
  6. Deployment
    • REST API
    • Shiny App
    • Markdown file output

4.2 Business Science Problem Framework

The goal is to make good decisions. Systematic decision making starts with measurement and analysis which leads to improvement.

The DSBF is based on Principles by Ray Dalio

BSPF has 7 steps:

  1. Business as a Machine

Each part of the business has Objectives –> Machine –> Outcomes.

  • Objectives are the goals that serve a inputs or instructions to the machine
  • Machine - performs the work and is made of people and processes
    • If People or process is not right, the machine produces questionable output
    • People have biases
    • Process issues abound in all organizations
    • People/process design fit may be poor
  • Outcomes - what the machine produces
    • Provides feedback to determine is the objectives are being met

Our goal is to understand the machine using data

When you have exhausted all possibilities, remember this - you haven’t.

  • Thomas Jefferson

5 Business Understanding: BSPF & Code Workflows

5.1 Problem Understanding With BSPF

5.1.1 Business Understanding

The BSPF includes the following steps of the the CRISP-DM Framework:

  1. View the Business as a Machine
  2. Understand the Drivers
  3. Measure the Drivers
  4. Uncover Problems & Opportunities

5.1.2 View The Business As A Machine

Review business_understanding.R file in 01_Business Understanding folder

Business Units: Department and Job Role Define Objectives: Retain high performers Assess Outcomes: TBD - Feedback from the business

Known - 16.1% attrition in the current state - is this a bad thing?

5.2 Understand The Drivers: By Dept

Investigate the Objectives: 16% Attrition Synthesize Outcomes: High counts and high percentages (learned after hypothesizing drivers). See below under Understand the Drivers. Hypothesize Drivers: Job Role and Departments

## # A tibble: 6 x 4
## # Groups:   Department [3]
##   Department             Attrition     n   pct
##   <chr>                  <chr>     <int> <dbl>
## 1 Human Resources        No           37 0.755
## 2 Human Resources        Yes          12 0.245
## 3 Research & Development No          721 0.867
## 4 Research & Development Yes         111 0.133
## 5 Sales                  No          291 0.789
## 6 Sales                  Yes          78 0.211

Department and Job Roles are common cohorts to evaluate.

HR has 24% attrition, Sales 21% - might be something going on by department.

5.3 Understand The Drivers: By Job Role

## # A tibble: 10 x 5
##    Department             JobRole                   Attrition     n    pct
##    <chr>                  <chr>                     <chr>     <int>  <dbl>
##  1 Human Resources        Human Resources           Yes          12 0.308 
##  2 Research & Development Healthcare Representative Yes           8 0.0762
##  3 Research & Development Laboratory Technician     Yes          49 0.219 
##  4 Research & Development Manager                   Yes           2 0.0417
##  5 Research & Development Manufacturing Director    Yes           7 0.0569
##  6 Research & Development Research Director         Yes           2 0.0274
##  7 Research & Development Research Scientist        Yes          43 0.166 
##  8 Sales                  Manager                   Yes           2 0.0645
##  9 Sales                  Sales Executive           Yes          50 0.183 
## 10 Sales                  Sales Representative      Yes          26 0.4

Note there are different roles with each department. Perhaps certain roles have greater attrition issues than others?
Note the HR 30%; Note Sales Executive - only 18.3% but there are 50 leaving; Sales Representative may be a problem too.

5.4 Measure The Drivers: Collect Data

Now in the BSPF - Measure the Drivers Step

This is an iterative task that is ongoing.

You may not have all the data initially. Further, you don’t want all the data since many of the features are a distraction prior to understanding if a problem exists. During the Business Understanding phase we need to isolate the critical features that relate to business outcomes and determine if a problem exists such that when solved it provides real value in terms of financial benefit.

There are 35 features to consider:

##  [1] "Age"                      "Attrition"               
##  [3] "BusinessTravel"           "DailyRate"               
##  [5] "Department"               "DistanceFromHome"        
##  [7] "Education"                "EducationField"          
##  [9] "EmployeeCount"            "EmployeeNumber"          
## [11] "EnvironmentSatisfaction"  "Gender"                  
## [13] "HourlyRate"               "JobInvolvement"          
## [15] "JobLevel"                 "JobRole"                 
## [17] "JobSatisfaction"          "MaritalStatus"           
## [19] "MonthlyIncome"            "MonthlyRate"             
## [21] "NumCompaniesWorked"       "Over18"                  
## [23] "OverTime"                 "PercentSalaryHike"       
## [25] "PerformanceRating"        "RelationshipSatisfaction"
## [27] "StandardHours"            "StockOptionLevel"        
## [29] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [31] "WorkLifeBalance"          "YearsAtCompany"          
## [33] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [35] "YearsWithCurrManager"

5.4.1 Group Features

Some features related to one and other.

It is always helpful to breakdown data collection activities in to strategic areas.

  • Descriptive Features:
    • Age, DistanceFromHome, Gebder, MaritalStatus, NumComaniesWorked, Over18
  • Employment Features:
    • Department, EmployeeCount, EmployeeNumber, JobInvolement, JobLevel, JobRole, JobSatistfaction
  • Compensation Features:
    • DailyRate, HourlyRate, MonthlyIncome, MonthlyRate, PercentSalaryHike, StockOptionLevel
  • Survey Results:
    • EnvironmentSatisfaction, JobSatisfaction, RelationshipSatisfaction, WorkLifeBalance
  • Performance Data:
    • JobInvolvement, PerformanceRating
  • Work-Life Balance
    • BusinessTravel, OverTime
  • Training and Education
    • Education, EducationField, TrianingTimesLastYear
  • Time-Based Features
    • TotalWorkingYears, YearsAtCompany, YeasInCurrentRole, YearsSinceLastPromotion, YearsWithCurrentManager

5.5 Measure The Drivers: Develop KPIs

Often review turnover statistics in your industry. We know our is 16% but we do not know if this is good or bad without [research](http://www.air-prehire.com/blog/industries-high-employee-turnover-rates/.
Perhaps it simply fails an internal KPI.

Via web searching, 8.8% average turnover for utility companies (useful for this data). (Banking is around 17.2%) {#case_when}

## # A tibble: 10 x 6
##    Department      JobRole         Attrition     n    pct above_industry_~
##    <chr>           <chr>           <chr>     <int>  <dbl> <chr>           
##  1 Sales           Sales Represen~ Yes          26 0.4    Yes             
##  2 Human Resources Human Resources Yes          12 0.308  Yes             
##  3 Research & Dev~ Laboratory Tec~ Yes          49 0.219  Yes             
##  4 Sales           Sales Executive Yes          50 0.183  Yes             
##  5 Research & Dev~ Research Scien~ Yes          43 0.166  Yes             
##  6 Research & Dev~ Healthcare Rep~ Yes           8 0.0762 No              
##  7 Sales           Manager         Yes           2 0.0645 No              
##  8 Research & Dev~ Manufacturing ~ Yes           7 0.0569 No              
##  9 Research & Dev~ Manager         Yes           2 0.0417 No              
## 10 Research & Dev~ Research Direc~ Yes           2 0.0274 No

Note the Dept and Job Role with high percentages and counts. To make this more impactful, add cost to this table.

5.6 Uncover Problems & Opportunities

5.6.1 calculate_attrition_cost()

Create a function the mimics the Excel Calculator.

The function includes default values that equal the defaults that were introduced in Calculating the Cost of Turnover. So if we call the custom function calculate_attrition_cost with the default values, it returns 78483.33. If we change the default value of n = 1 (representing 1 employee) and change it to n = 200 then the function returns 15696666.67.

5.6.2 Calculating Cost By Job Level

Using code that we have already seen above and modified it slightly to focus on JobRole rather than Department, we can begin this part of the journey reviewing JobRole data. Note the we reuse a custom function - calculate_attrition_cost to return the cost_of_attrition column below.

## # A tibble: 9 x 6
##   JobRole        Attrition     n    pct above_industry_a~ cost_of_attriti~
##   <chr>          <chr>     <int>  <dbl> <chr>                        <dbl>
## 1 Sales Represe~ Yes          26 0.4    Yes                       2040567.
## 2 Human Resourc~ Yes          12 0.308  Yes                        941800.
## 3 Laboratory Te~ Yes          49 0.219  Yes                       3845683.
## 4 Sales Executi~ Yes          50 0.183  Yes                       3924167.
## 5 Research Scie~ Yes          43 0.166  Yes                       3374783.
## 6 Healthcare Re~ Yes           8 0.0762 No                         627867.
## 7 Manufacturing~ Yes           7 0.0569 No                         549383.
## 8 Manager        Yes           4 0.0449 No                         313933.
## 9 Research Dire~ Yes           2 0.0274 No                         156967.

Note the costs associated with Sales Executive and Laboratory Technician.

5.7 Streamlining The Attrition Code Workflow

This section is dedicated to streamlining the code developed above.

5.7.1 Streamlining The Counts

This is code commonly used:

## # A tibble: 18 x 3
## # Groups:   JobRole [?]
##    JobRole                   Attrition     n
##    <chr>                     <chr>     <int>
##  1 Healthcare Representative No           97
##  2 Healthcare Representative Yes           8
##  3 Human Resources           No           27
##  4 Human Resources           Yes          12
##  5 Laboratory Technician     No          175
##  6 Laboratory Technician     Yes          49
##  7 Manager                   No           85
##  8 Manager                   Yes           4
##  9 Manufacturing Director    No          116
## 10 Manufacturing Director    Yes           7
## 11 Research Director         No           71
## 12 Research Director         Yes           2
## 13 Research Scientist        No          216
## 14 Research Scientist        Yes          43
## 15 Sales Executive           No          223
## 16 Sales Executive           Yes          50
## 17 Sales Representative      No           39
## 18 Sales Representative      Yes          26

The code above is replaced with this simpler code:

## # A tibble: 18 x 3
##    JobRole                   Attrition     n
##    <chr>                     <chr>     <int>
##  1 Healthcare Representative No           97
##  2 Healthcare Representative Yes           8
##  3 Human Resources           No           27
##  4 Human Resources           Yes          12
##  5 Laboratory Technician     No          175
##  6 Laboratory Technician     Yes          49
##  7 Manager                   No           85
##  8 Manager                   Yes           4
##  9 Manufacturing Director    No          116
## 10 Manufacturing Director    Yes           7
## 11 Research Director         No           71
## 12 Research Director         Yes           2
## 13 Research Scientist        No          216
## 14 Research Scientist        Yes          43
## 15 Sales Executive           No          223
## 16 Sales Executive           Yes          50
## 17 Sales Representative      No           39
## 18 Sales Representative      Yes          26

count is part of dplyr!

Do not forget about the dplyr::tally function - very useful

tally() is a convenient wrapper for summarize that will either call n() or sum(n) depending on whether you’re tallying for the first time, or re-tallying. count() is similar but calls group_by() before and ungroup() after.

5.7.2 Streamlining The Count To Percentage Calculation

Pay attention - this is really good code using tidy evaluation!

We have this code so far in streamlining the attrition code:

## # A tibble: 18 x 4
## # Groups:   JobRole [9]
##    JobRole                   Attrition     n    pct
##    <chr>                     <chr>     <int>  <dbl>
##  1 Healthcare Representative No           97 0.924 
##  2 Healthcare Representative Yes           8 0.0762
##  3 Human Resources           No           27 0.692 
##  4 Human Resources           Yes          12 0.308 
##  5 Laboratory Technician     No          175 0.781 
##  6 Laboratory Technician     Yes          49 0.219 
##  7 Manager                   No           85 0.955 
##  8 Manager                   Yes           4 0.0449
##  9 Manufacturing Director    No          116 0.943 
## 10 Manufacturing Director    Yes           7 0.0569
## 11 Research Director         No           71 0.973 
## 12 Research Director         Yes           2 0.0274
## 13 Research Scientist        No          216 0.834 
## 14 Research Scientist        Yes          43 0.166 
## 15 Sales Executive           No          223 0.817 
## 16 Sales Executive           Yes          50 0.183 
## 17 Sales Representative      No           39 0.6   
## 18 Sales Representative      Yes          26 0.4

Calculating percentages from counts is a common programming requirements. Why not create a function that generalizes this?

  • tidy functions always requires the data to take the first position of the function{#Tidy_Eval}
  • ... enables passing multiple, un-named arguments to a function
    • Because the dots are not preselected, the user can flexibly add variables and the function will adapt!
  • The 3rd function argument is a column name, in this case n since we are calculating a percentage from the counts in the n column
  1. ... must be enclosed on quos or enquo for multiple columns or single columns, respectfully. The saves the column names as unevalutated expressions. They can be called later in the processing. Use quos for multiple columns and enquo for the single column we have called n
  2. !!! called bang-bang-bang splices multiple grouping variables and evaluating them. (Use !! for single columns.)

Demo:

## # A tibble: 21 x 5
##    Department             JobRole                   Attrition     n    pct
##    <chr>                  <chr>                     <chr>     <int>  <dbl>
##  1 Human Resources        Human Resources           No           27 0.692 
##  2 Human Resources        Human Resources           Yes          12 0.308 
##  3 Human Resources        Manager                   No           10 1     
##  4 Research & Development Healthcare Representative No           97 0.924 
##  5 Research & Development Healthcare Representative Yes           8 0.0762
##  6 Research & Development Laboratory Technician     No          175 0.781 
##  7 Research & Development Laboratory Technician     Yes          49 0.219 
##  8 Research & Development Manager                   No           46 0.958 
##  9 Research & Development Manager                   Yes           2 0.0417
## 10 Research & Development Manufacturing Director    No          116 0.943 
## # ... with 11 more rows

5.7.3 Streamlining The Attrition Assessment

Continuing the development of a streamlined function, review where we are so far:

## # A tibble: 10 x 6
##    Department      JobRole         Attrition     n    pct above_industry_~
##    <chr>           <chr>           <chr>     <int>  <dbl> <chr>           
##  1 Sales           Sales Represen~ Yes          26 0.4    Yes             
##  2 Human Resources Human Resources Yes          12 0.308  Yes             
##  3 Research & Dev~ Laboratory Tec~ Yes          49 0.219  Yes             
##  4 Sales           Sales Executive Yes          50 0.183  Yes             
##  5 Research & Dev~ Research Scien~ Yes          43 0.166  Yes             
##  6 Research & Dev~ Healthcare Rep~ Yes           8 0.0762 No              
##  7 Sales           Manager         Yes           2 0.0645 No              
##  8 Research & Dev~ Manufacturing ~ Yes           7 0.0569 No              
##  9 Research & Dev~ Manager         Yes           2 0.0417 No              
## 10 Research & Dev~ Research Direc~ Yes           2 0.0274 No

Generalize filter - arrange - mutate statements:

  • As always, tidy eval starts with a data (tibble)
  • Next argument is the name of the Attrition column - it may not always be called Attrition
  • 3rd argument are the values of the attrition_col - in our case the values are Yes and No.
  • baseline_pct is hopefully self-explanatory!
  1. Recall enquo saves a single column name as an expression that can be evaluated later. This way the name of the Attrition column can be whatever the user chooses!
  2. pct does not change because that is created consistently in the count_to_pct function. (Of course this could be changed as an enhancement!)
  3. Simple substitution for 0.088 to baseline_pct.

Use parenthesis to give Tidy Eval evaluations priority!

Function inputs in text format (attrition_value = “Yes”) do not require Tidy Eval

Demo the new code:

## # A tibble: 10 x 7
##    Department JobRole Attrition     n    pct above_industry_~
##    <chr>      <chr>   <chr>     <int>  <dbl> <chr>           
##  1 Sales      Sales ~ Yes          26 0.4    Yes             
##  2 Human Res~ Human ~ Yes          12 0.308  Yes             
##  3 Research ~ Labora~ Yes          49 0.219  Yes             
##  4 Sales      Sales ~ Yes          50 0.183  Yes             
##  5 Research ~ Resear~ Yes          43 0.166  Yes             
##  6 Research ~ Health~ Yes           8 0.0762 No              
##  7 Sales      Manager Yes           2 0.0645 No              
##  8 Research ~ Manufa~ Yes           7 0.0569 No              
##  9 Research ~ Manager Yes           2 0.0417 No              
## 10 Research ~ Resear~ Yes           2 0.0274 No              
## # ... with 1 more variable: cost_of_attrition <dbl>

5.8 Visualizing Attrition With ggplot2

Create impactful visualization like this:

  • The plot above combines Department and JobRole together as labels for the y-axis.
  • Need to change the type of the string manipulated in #1 to a factor and order it from high to low based on cost_of_attrition to make the plot readable.
  • Need to format the text labels to make the plot easy-to-read and professional!

5.8.1 Data Manipulation For Visualization

Here is the code so far plus some data manipulation so we can produce the plot. {#fct_reorder}

## # A tibble: 10 x 9
##    Department JobRole Attrition     n    pct above_industry_~
##    <chr>      <chr>   <chr>     <int>  <dbl> <chr>           
##  1 Sales      Sales ~ Yes          26 0.4    Yes             
##  2 Human Res~ Human ~ Yes          12 0.308  Yes             
##  3 Research ~ Labora~ Yes          49 0.219  Yes             
##  4 Sales      Sales ~ Yes          50 0.183  Yes             
##  5 Research ~ Resear~ Yes          43 0.166  Yes             
##  6 Research ~ Health~ Yes           8 0.0762 No              
##  7 Sales      Manager Yes           2 0.0645 No              
##  8 Research ~ Manufa~ Yes           7 0.0569 No              
##  9 Research ~ Manager Yes           2 0.0417 No              
## 10 Research ~ Resear~ Yes           2 0.0274 No              
## # ... with 3 more variables: cost_of_attrition <dbl>, name <fct>,
## #   cost_text <chr>
  • LINE 1: str_c from stringr concatenates Department and JobRole with a colon in between them.
    • factors are required in ggplot because they are used to order categorical variables (text will not work)
    • as_factor is part of forcats. This function creates levels in the order in which they appear, which will be the same on every platform. (Base R sorts in the current locale which can vary from place to place.)
  • LINE 2: fct_reorder also from forcats reorders a factors numeric values by the magnitude of a different numeric value, in this case, cost_of_attrition.
  • LINE 3: str_c concatenation again - pretty simple - but effective!

5.8.2 Visualization With ggplot2

OK, the data manipulation to plot the cost or attrition is complete. Now review the ggpolt code to produce the plot shown in the beginning on this section. It is easy - provided you do one step at a time!

  • LINE 1: geom_segment adds lines to the plot canvas. The first ggplot statement locates the end of the line - at the cost_of_attrition. xend defines the start of the line segment at x = 0. yend make sure the line segment aligns with the correct y-axis label. Simple!
    • palette_light()[[1]] defines the color #2c3e5 from the tidyquant package.
    • the other colors in palette_light are: #2c3e50, #e31a1c, #18BC9C, #CCBE93, #a6cee3, #1f78b4, #b2df8a, #fb9a99, #fdbf6f, #ff7f00, #cab2d6, #6a3d9a
  • LINE 2: size = cost_of_attrition adjusts the size of the segment endpoint to the attrition cost magnitude (could have included this in the first aes statement)
  • LINE 3: Formatting the x-labels with the scales package to get the values in dollar format. Remember the scale_x_continuous ggplot function!
  • LINE 4: geom_label simply adds labels to points on the plot using cost_text` that was created easier in the plot data manipulation.
    • Cool the label size can be dynamically set by the attrition cost.
  • LINE 5: theme_tq from tidyquant is a general theme that adjusts background, faceting palette, legend position and more. Works well for conservative business publications.
    • scale_size adjusts the max and min size of elements to prevent large/small values from becoming too large/small. The label sizes are now limited to size = 3 making the lower values label text now readable. Cool!

5.10 Developing plot_attrition() {#plot_attrition} {#tidy_eval2}

First, an explanation of quosures and tidy evaluation:

There are four functions: - quo() and enquo(): singular, captures one variable, stored as unevaluated expression - quos() and enquos(): plural, captures multiple variables, stored as list of multiple unevaluated expressions

Explanation: enquo() is singular and used inside a function. It is comparable to quo() which is used outside of functions. Both functions do roughly the same thing, which is quote (meaning capturing the unevaluated variable as an expression before it gets updated). The “quoted” variable is stored for evaluation later. The main difference between quo() and enquo() is quo() should be used outside functions and enquo() should be used inside. The reason why is not important and really complex.

The rlang library now has an enquos() function that is plural (rlang did not always have enquos() which is why I usedquos()was used.. Both capture multiple variables as expressions contained inside a list. This is useful when you don't know how many arguments the user will supply. For example several columns. The difference betweenquos()andenquos()is, again, thatquos()is meant to be used external to functions whereasenquos()is meant to be used internal. However - This is not what the tutorial shows because of an inconsistency with early versions ofrlangthat did not include theenquos()` function.

rlang::sym turns a single character string into an expression. The expression is typically captured in enquo()orquos()to delay evaluation. Thesym()andsyms()functions are only necessary when converting a string to a symbol. Generally the user will supply a symbol, such as a column name. They don’t typically supply a string. Therefore, we only need to usesym/syms` in this special case where we have a string that represents one or more column names.

In summary: - Use quo() / enquo() when capturing one variable. - Use quos() / enquos() when capturing multiple variables. - Use quo() / quos() outside of functions. - Use enquo() / enquos() inside of functions.

plot_attrition <- function(data, ..., .value, # ... capture the departments - mutate(name = str_c(Department, JobRole....; .value - cost_of_attrition
                           fct_reorder = TRUE, # order the function by the .value argument
                           fct_rev = FALSE, # ascending vs descending y lables
                           include_lbl = TRUE,
                           color = palette_light()[[1]],
                           units = c("0", "K", "M")) {
                           #sometimes the untis will not be in millions, default is 0 - no units
# Inputs

     group_vars_expr <- quos(...) #perhaps department and JobRoles
     if(length(group_vars_expr) == 0) # Then just use the first column name - these drive the y labels in the plot so there has to be something!
          group_vars_expr <- quos(rlang::sym(colnames(data)[[1]]))
          
     value_expr <- enquo(.value)# for data manipulation (its cost_of_attrition - used in data manipulation and plotting)
     value_name <- quo_name(value_expr)# for ggplot - makes a character string in quotes (opposite of rlang::sym()); ggplot is not ready for tidy eval
     
     units_val <- switch(units[[1]], "M" = 1e6, "K" = 1e3, "0" = 1) #work like nested if functions
     if (units[[1]] == "0") units <- "" #Make it empty because it would get appended to labels in the plot
     
# Data Manipulation
     
     #Function factory - a function that produces a function, usd is the function factory - try it usd(1000)
     usd <- scales::dollar_format(prefix = "$", largest_with_cents = 1e3)

     data_manipulated <- data %>% 
          mutate(name = str_c(!!! group_vars_expr, sep = ": ") %>% as_factor()) %>% 
          #Above () removed around !!! group_vars_expr otherwise str_c sees it a single input - it is a psuedo bug in str_c
          mutate(value_text = str_c(usd(!! value_expr / units_val), units[[1]], sep = ""))
     
     if(fct_reorder){data_manipulated <- data_manipulated %>% 
          mutate(name = fct_reorder(name, (!! value_expr))) %>% 
          arrange(name)} # arrange() is used because fct_reorder() does not actually sort the data, arrange() does the sorting
     
     if(fct_rev){data_manipulated <- data_manipulated %>% 
          mutate(name = fct_rev(name)) %>% 
          arrange(name)}
          
# Visualization
# NOTE:  ggplot does not work with tidyeval framework - requires use of aes_string
     
     g <- data_manipulated %>% 
          ggplot(aes_string(x = value_name, y = "name")) +
          geom_segment(aes(xend = 0, yend = name), color = color) +
          geom_point(aes_string(size = value_name), color = color) +
          scale_x_continuous(label = scales::dollar) +
          theme_tq() +
          scale_size(name = c(4,5)) + #c(3,5) should work but I am not satifies with the output.
          theme(legend.position = "none")
     
     if(include_lbl){
          g <-  g + geom_label(aes_string(label = "value_text", size = value_name), hjust = "inward", color = color)
     }
     return(g)
} 

A potential problem with an error when trying to use this function on a grouped data set.

The warning thrown looks like this: mutate_impl(.data, dots) : binding character and factor vector, coercing into character vector The solution is to add ungroup() before the factoring within the Data Manipulation step.

5.11 More Realistic Attrition Cost

Using estimated average salary and average productivity (revenue) by Job Role and Department, create a more realistic attrition cost estimate.

See homework_1_assignment.R for some relatively complex dplyr data code.

Here is a peek at the new Excel file that provides the source information.

## # A tibble: 11 x 4
##    Department          JobRole              Salary_Average Revenue_Average
##    <chr>               <chr>                         <dbl>           <dbl>
##  1 Human Resources     Human Resources               65000         104000 
##  2 Human Resources     Manager                       85000         153000 
##  3 Research & Develop~ Healthcare Represen~          45000          76500 
##  4 Research & Develop~ Laboratory Technici~          65000         110500 
##  5 Research & Develop~ Manager                       88000         158400 
##  6 Research & Develop~ Manufacturing Direc~         100000         220000.
##  7 Research & Develop~ Research Director            125000         312500 
##  8 Research & Develop~ Research Scientist            95000         171000 
##  9 Sales               Manager                       95000         190000 
## 10 Sales               Sales Executive              140000         308000 
## 11 Sales               Sales Representative          80000         160000

6 Data Understanding

This is the start of the 2nd phase of CRISP-DM - Data Understanding.

  1. Collect Data is complete - the Excel files have been provided.
  2. EDA to explore the critical properties of the underlying features and potential relationships to attrition.

This section introduces skimr - a tool for feature exploration by data type. GGally is also introduced to visualize feature interaction relationships using ggpairs.

May want to review the feature grouping introduced earlier. Group Features

6.1 EDA: Exploring Data By Data Type

6.1.1 Character Data

skimr does provide a good overview of the data automatically separating character and numeric data. For character data, it is simple to see if any features have more than perhaps 9 unique features, it might suggest binning. The numeric data with histograms makes it easy to see which variable might be categorical data very quickly.

## Skim summary statistics
##  n obs: 1250 
##  n variables: 35 
## 
## -- Variable type:character ---------------------------------------------------------------------------------------------
##        variable missing complete    n min max empty n_unique
##       Attrition       0     1250 1250   2   3     0        2
##  BusinessTravel       0     1250 1250  10  17     0        3
##      Department       0     1250 1250   5  22     0        3
##  EducationField       0     1250 1250   5  16     0        6
##          Gender       0     1250 1250   4   6     0        2
##         JobRole       0     1250 1250   7  25     0        9
##   MaritalStatus       0     1250 1250   6   8     0        3
##          Over18       0     1250 1250   1   1     0        1
##        OverTime       0     1250 1250   2   3     0        2
## 
## -- Variable type:numeric -----------------------------------------------------------------------------------------------
##                  variable missing complete    n     mean      sd   p0
##                       Age       0     1250 1250    36.95    9.08   18
##                 DailyRate       0     1250 1250   799.74  403.52  102
##          DistanceFromHome       0     1250 1250     9.1     8.1     1
##                 Education       0     1250 1250     2.91    1.03    1
##             EmployeeCount       0     1250 1250     1       0       1
##            EmployeeNumber       0     1250 1250  1022.71  600.42    1
##   EnvironmentSatisfaction       0     1250 1250     2.71    1.09    1
##                HourlyRate       0     1250 1250    65.82   20.14   30
##            JobInvolvement       0     1250 1250     2.75    0.71    1
##                  JobLevel       0     1250 1250     2.08    1.12    1
##           JobSatisfaction       0     1250 1250     2.73    1.11    1
##             MonthlyIncome       0     1250 1250  6569.4  4777.89 1009
##               MonthlyRate       0     1250 1250 14294.5  7071.61 2097
##        NumCompaniesWorked       0     1250 1250     2.71    2.49    0
##         PercentSalaryHike       0     1250 1250    15.23    3.69   11
##         PerformanceRating       0     1250 1250     3.16    0.36    3
##  RelationshipSatisfaction       0     1250 1250     2.71    1.09    1
##             StandardHours       0     1250 1250    80       0      80
##          StockOptionLevel       0     1250 1250     0.8     0.86    0
##         TotalWorkingYears       0     1250 1250    11.4     7.79    0
##     TrainingTimesLastYear       0     1250 1250     2.81    1.29    0
##           WorkLifeBalance       0     1250 1250     2.76    0.71    1
##            YearsAtCompany       0     1250 1250     7.09    6.21    0
##        YearsInCurrentRole       0     1250 1250     4.26    3.66    0
##   YearsSinceLastPromotion       0     1250 1250     2.21    3.2     0
##      YearsWithCurrManager       0     1250 1250     4.16    3.59    0
##      p25     p50      p75  p100     hist
##    30       36      43       60 <U+2582><U+2583><U+2586><U+2587><U+2585><U+2583><U+2582><U+2582>
##   464      798    1156     1498 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
##     2        7      13.75    29 <U+2587><U+2585><U+2583><U+2581><U+2581><U+2581><U+2581><U+2581>
##     2        3       4        5 <U+2582><U+2585><U+2581><U+2587><U+2581><U+2586><U+2581><U+2581>
##     1        1       1        1 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##   486.25  1020.5  1553.75  2068 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
##     2        3       4        4 <U+2585><U+2581><U+2585><U+2581><U+2581><U+2587><U+2581><U+2587>
##    48       66      83      100 <U+2586><U+2587><U+2587><U+2587><U+2586><U+2587><U+2587><U+2587>
##     2        3       3        4 <U+2581><U+2581><U+2583><U+2581><U+2581><U+2587><U+2581><U+2582>
##     1        2       3        5 <U+2587><U+2587><U+2581><U+2583><U+2581><U+2582><U+2581><U+2581>
##     2        3       4        4 <U+2585><U+2581><U+2585><U+2581><U+2581><U+2587><U+2581><U+2587>
##  2935.25  4903.5  8395    19999 <U+2587><U+2587><U+2583><U+2582><U+2581><U+2581><U+2581><U+2582>
##  8191.25 14328   20337.25 26999 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2586>
##     1        2       4        9 <U+2587><U+2582><U+2582><U+2582><U+2581><U+2581><U+2581><U+2581>
##    12       14      18       25 <U+2587><U+2587><U+2583><U+2583><U+2582><U+2582><U+2582><U+2581>
##     3        3       3        4 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2582>
##     2        3       4        4 <U+2585><U+2581><U+2586><U+2581><U+2581><U+2587><U+2581><U+2587>
##    80       80      80       80 <U+2581><U+2581><U+2581><U+2587><U+2581><U+2581><U+2581><U+2581>
##     0        1       1        3 <U+2587><U+2581><U+2587><U+2581><U+2581><U+2582><U+2581><U+2581>
##     6       10      15       40 <U+2583><U+2587><U+2582><U+2582><U+2582><U+2581><U+2581><U+2581>
##     2        3       3        6 <U+2581><U+2581><U+2587><U+2587><U+2581><U+2582><U+2582><U+2581>
##     2        3       3        4 <U+2581><U+2581><U+2583><U+2581><U+2581><U+2587><U+2581><U+2582>
##     3        5      10       40 <U+2587><U+2585><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     2        3       7       18 <U+2587><U+2583><U+2581><U+2585><U+2581><U+2581><U+2581><U+2581>
##     0        1       3       15 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##     2        3       7       17 <U+2587><U+2583><U+2581><U+2583><U+2581><U+2581><U+2581><U+2581>

6.1.2 Character Data

## Observations: 1,250
## Variables: 9
## $ Attrition      <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "No...
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel_R...
## $ Department     <chr> "Sales", "Research & Development", "Research & ...
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "Lif...
## $ Gender         <chr> "Female", "Male", "Male", "Female", "Male", "Ma...
## $ JobRole        <chr> "Sales Executive", "Research Scientist", "Labor...
## $ MaritalStatus  <chr> "Single", "Married", "Single", "Married", "Marr...
## $ Over18         <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
## $ OverTime       <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "...

Character data does not have levels so use unique instead. map from purrr maps unique across each feature.

## $Attrition
## [1] "Yes" "No" 
## 
## $BusinessTravel
## [1] "Travel_Rarely"     "Travel_Frequently" "Non-Travel"       
## 
## $Department
## [1] "Sales"                  "Research & Development"
## [3] "Human Resources"       
## 
## $EducationField
## [1] "Life Sciences"    "Other"            "Medical"         
## [4] "Marketing"        "Technical Degree" "Human Resources" 
## 
## $Gender
## [1] "Female" "Male"  
## 
## $JobRole
## [1] "Sales Executive"           "Research Scientist"       
## [3] "Laboratory Technician"     "Manufacturing Director"   
## [5] "Healthcare Representative" "Manager"                  
## [7] "Sales Representative"      "Research Director"        
## [9] "Human Resources"

This improves upon unique by including the counts for each character feature.

## $Attrition
## 
##   No  Yes 
## 1049  201 
## 
## $BusinessTravel
## 
##        Non-Travel Travel_Frequently     Travel_Rarely 
##               124               239               887 
## 
## $Department
## 
##        Human Resources Research & Development                  Sales 
##                     49                    832                    369 
## 
## $EducationField
## 
##  Human Resources    Life Sciences        Marketing          Medical 
##               23              520              135              392 
##            Other Technical Degree 
##               66              114 
## 
## $Gender
## 
## Female   Male 
##    491    759 
## 
## $JobRole
## 
## Healthcare Representative           Human Resources 
##                       105                        39 
##     Laboratory Technician                   Manager 
##                       224                        89 
##    Manufacturing Director         Research Director 
##                       123                        73 
##        Research Scientist           Sales Executive 
##                       259                       273 
##      Sales Representative 
##                        65

Counts are nice but proportions are better! Below, anonymous function is used. {#anonymous}

The ~ indicates a formula starts

Also understand that train_raw_tbl %>% select_if(is.character) %>% map(~ table(.)) is the sames as train_raw_tbl %>% select_if(is.character) %>% map(table).

## $Attrition
## .
##   No  Yes 
## 0.84 0.16 
## 
## $BusinessTravel
## .
##        Non-Travel Travel_Frequently     Travel_Rarely 
##             0.099             0.191             0.710 
## 
## $Department
## .
##        Human Resources Research & Development                  Sales 
##                  0.039                  0.666                  0.295 
## 
## $EducationField
## .
##  Human Resources    Life Sciences        Marketing          Medical 
##            0.018            0.416            0.108            0.314 
##            Other Technical Degree 
##            0.053            0.091 
## 
## $Gender
## .
## Female   Male 
##   0.39   0.61 
## 
## $JobRole
## .
## Healthcare Representative           Human Resources 
##                     0.084                     0.031 
##     Laboratory Technician                   Manager 
##                     0.179                     0.071 
##    Manufacturing Director         Research Director 
##                     0.098                     0.058 
##        Research Scientist           Sales Executive 
##                     0.207                     0.218 
##      Sales Representative 
##                     0.052 
## 
## $MaritalStatus
## .
## Divorced  Married   Single 
##     0.22     0.46     0.32 
## 
## $Over18
## .
## Y 
## 1 
## 
## $OverTime
## .
##   No  Yes 
## 0.72 0.28

6.1.3 Numeric Data

map_df provides a much cleaner output! Outputs a dataframe rather than a list.

## # A tibble: 1 x 26
##     Age DailyRate DistanceFromHome Education EmployeeCount EmployeeNumber
##   <int>     <int>            <int>     <int>         <int>          <int>
## 1    43       808               29         5             1           1250
## # ... with 20 more variables: EnvironmentSatisfaction <int>,
## #   HourlyRate <int>, JobInvolvement <int>, JobLevel <int>,
## #   JobSatisfaction <int>, MonthlyIncome <int>, MonthlyRate <int>,
## #   NumCompaniesWorked <int>, PercentSalaryHike <int>,
## #   PerformanceRating <int>, RelationshipSatisfaction <int>,
## #   StandardHours <int>, StockOptionLevel <int>, TotalWorkingYears <int>,
## #   TrainingTimesLastYear <int>, WorkLifeBalance <int>,
## #   YearsAtCompany <int>, YearsInCurrentRole <int>,
## #   YearsSinceLastPromotion <int>, YearsWithCurrManager <int>

{#gather_example}

## # A tibble: 13 x 2
##    key                      value
##    <chr>                    <int>
##  1 EmployeeCount                1
##  2 StandardHours                1
##  3 PerformanceRating            2
##  4 EnvironmentSatisfaction      4
##  5 JobInvolvement               4
##  6 JobSatisfaction              4
##  7 RelationshipSatisfaction     4
##  8 StockOptionLevel             4
##  9 WorkLifeBalance              4
## 10 Education                    5
## 11 JobLevel                     5
## 12 TrainingTimesLastYear        7
## 13 NumCompaniesWorked          10

Numeric features that have fewer levels are likely to be discrete and numeric variables with more are likely continuous. (Note above the features where value = 1 may indicate non-essential or zero-variance variables.) Above are features that might be converted to factors. The table below suggests the features remain continuous.

## # A tibble: 13 x 2
##    key                     value
##    <chr>                   <int>
##  1 PercentSalaryHike          15
##  2 YearsSinceLastPromotion    16
##  3 YearsWithCurrManager       18
##  4 YearsInCurrentRole         19
##  5 DistanceFromHome           29
##  6 YearsAtCompany             37
##  7 TotalWorkingYears          40
##  8 Age                        43
##  9 HourlyRate                 71
## 10 DailyRate                 808
## 11 MonthlyIncome            1161
## 12 MonthlyRate              1217
## 13 EmployeeNumber           1250

6.1.4 Visualizing The Feature-Target Interactions

To begin the visualization of the data, select the descriptive variables (serves as an example). There is a mix of character and continuous features.

## # A tibble: 1,250 x 7
##    Attrition   Age Gender MaritalStatus NumCompaniesWor~ Over18
##    <chr>     <dbl> <chr>  <chr>                    <dbl> <chr> 
##  1 Yes          41 Female Single                       8 Y     
##  2 No           49 Male   Married                      1 Y     
##  3 Yes          37 Male   Single                       6 Y     
##  4 No           33 Female Married                      1 Y     
##  5 No           27 Male   Married                      9 Y     
##  6 No           32 Male   Single                       0 Y     
##  7 No           59 Female Married                      4 Y     
##  8 No           30 Male   Divorced                     1 Y     
##  9 No           38 Male   Single                       0 Y     
## 10 No           36 Male   Married                      6 Y     
## # ... with 1,240 more rows, and 1 more variable: DistanceFromHome <dbl>

Apply ggpair function:

Customize the output to make it a bit easier to interpret.

  • It colors the Attrition to make it stand out
  • Removes the lower triangle - it is not quite as helpful or understandable
  • customizes what is shown on the diagonal and changes alpha

6.1.5 Custom Function: plot_ggpairs

Take the visualization above and create a function.

  • color= NULL will output normal grey. However, best to select a feature to highlight (typically the label)
    • rlang::quo_is_null simply tests for if color_expr == NULL. Test by simply typing color_expr in the console
  • Because color_name is a string after using quo_name, aes_string must be used

6.1.6 Evaluate ggpairs Output

Evaluate the data features by category and understand the importance of the features using the new ggpairs function. How great is that?

  1. Descriptive Features: age, gender, marital status, etc

Observations

  • AgeYounger people seem to leave more than older workers.
  • DistanceFromHome appears skewed - further away leave more frequently
  • Num Companies Worked appears more skewed for workers that have had more jobs
  • Marital Status interesting: No goes small large medium where Yes goes small medium large (height of histogram bars)
  1. Employment Features

Contains is NOT case sensitive. Also provides end_with option

Observations

  • EmployeeCount and EmployeeNumber are not useful - will not be used in model
  • Department: Higher proportion evident in Dept-Dept histogram in far right histogram bar
    • Recall there only only 3 departments in this data (HR, Research and Sales)
  • JobInvolvement: Density Plot shows higher 3rd spike and 4th to less extent - more likely to stay
  • JobLevel: Density plot shows deviation
  • JobRoles - some roles appear to have higher attrition rate by evaluating the histogram
  • JobStatisfaction: 1 - higher density of attrition than others
  1. Compensation Features

Observations

  • MonthlyRate: Those that are leaving have a lower Monthly Income
  • PercentSalaryHike: It’s difficult to deduce anything based on the visualization
  • StockOptionLevel: Those that are staying have a higher stock option level
  1. Survey Results

Observations

  • EnvironmentSatisfaction: A higher proportion of those leaving have a low environment satisfaction level
  • WorkLifeBalance: Those that are staying have a higher density of 2’s and 3’s
  1. Performance Data

  • Observations:
  • JobInvolvement: Those that are leaving have a lower density of 3’s and 4’s
  1. Work-Life Features

Observations

  • Overtime: The proportion of those leaving that are working Over Time are high compared to those that are not leaving
  1. Training and Education

Observations

  • TrainingTimesLastYear: People that leave tend to have less annual training
  1. Time-Based Features

Observations

  • YearsAtCompany: People that leave tend to have less working years at the company
  • YearsSinceLastPromotion: It’s difficult to deduce anything based on the visualization

7 Data Preparation

There are to goals for this section:

  1. How does the data have to be formatted for people to visualize and understand?
  • Prepare a preprocessing data pipeline for people.
  1. How does the data have to be formatted for machine learning algorithms to process and for other types of analysis like correlation analysis?
  • Use recipies to build machine learning preprocessing templates.

7.1 Data Preparation For People

Must merge the telco-train.xlsx and telco_definitions.xlsx files to present data in a meaningful way and that is the purpose of the processing pipeline is.

Look at an example:

The plot above is not very readable.

  • The Education numbers on the x-axis do not mean anything to most people.

The human readable pipeline will correct this and other pieces to make the plots more user-friendly.

7.1.1 Tidying The Data

Explore the data definitions file - note that many of the features are coded - numbers used as values for descriptions for categorical data.

## # A tibble: 35 x 2
##    X__1                    X__2             
##    <chr>                   <chr>            
##  1 Education               1 'Below College'
##  2 <NA>                    2 'College'      
##  3 <NA>                    3 'Bachelor'     
##  4 <NA>                    4 'Master'       
##  5 <NA>                    5 'Doctor'       
##  6 <NA>                    <NA>             
##  7 EnvironmentSatisfaction 1 'Low'          
##  8 <NA>                    2 'Medium'       
##  9 <NA>                    3 'High'         
## 10 <NA>                    4 'Very High'    
## # ... with 25 more rows

X__1 = Feature Name; x__2 = Feature Code + Feature Description

Tidying this data will allow us to replace numbers with something more meaningful using tidyverse functions.

  • fill replaces missing values (NAs) with the closet entry (previous if .direction = "down" or next if .direction = "up")
    • For example, this will fill NAs in Education under X__1 with Education - very cool!
  • Remove NAs under X__2
  • Break X__2 into 2 columns: one with the number, the other with the description
    • separate is from tidyr
    • remove simply removes the original X__2 column
    • Need to start using this!
  • Rename columns to make them more friendly - X__1 does not mean anything!
  • Remove the trailing tick mark from the new value column
    • str_replace replaces a matched pattern in a string with a replacement argument
## # A tibble: 29 x 3
##    column_name               key value        
##    <chr>                   <dbl> <chr>        
##  1 Education                   1 Below College
##  2 Education                   2 College      
##  3 Education                   3 Bachelor     
##  4 Education                   4 Master       
##  5 Education                   5 Doctor       
##  6 EnvironmentSatisfaction     1 Low          
##  7 EnvironmentSatisfaction     2 Medium       
##  8 EnvironmentSatisfaction     3 High         
##  9 EnvironmentSatisfaction     4 Very High    
## 10 JobInvolvement              1 Low          
## # ... with 19 more rows

7.1.2 Mapping Over Lists

definitions_tbl has multiple datasets in one dataframe. They will need to be integrated separately into the training and testing datasets. This will make more sense in a bit. This process starts by making definitions_tbl into a list.

This concept is new - read and study carefully!

  • split splits a dataframe into multiple dataframes contained within a list. Supply a column name as a vector (meaning .$column_name). Therefore, Education, EnvirnmentSatisfaction and JobInvolvement will be separated into individual dataframes. Example:
definitions_list[[3]]
# A tibble: 4 x 3
  column_name      key value    
  <chr>          <dbl> <chr>    
1 JobInvolvement     1 Low      
2 JobInvolvement     2 Medium   
3 JobInvolvement     3 High     
4 JobInvolvement     4 Very High
  • Use map and an anonymous function (the ~) to remove `column_name from each of the dataframes stored in the newly created list
    • Example:
    > definitions_list[[3]]
    # A tibble: 4 x 2
      key value    
    <dbl> <chr>    
    1     1 Low      
    2     2 Medium   
    3     3 High     
    4     4 Very High
  • Use map again with an anonymous function to each entry in the column value into a factor
## # A tibble: 4 x 2
##     key value    
##   <dbl> <fct>    
## 1     1 Low      
## 2     2 Medium   
## 3     3 High     
## 4     4 Very High

IMPORTANT: forcats::as_factor is different from base::as.factor. The former creates factor in the order in which they appear as opposed to alphabetically with base::as.factor

  • Rename the names from key and value to something like Education and Education_value
    • first get the name of the dataframe in the list and assign it the variable list_name
    • Then set the column names using list_name and making the value column into list_name pasted with _value - really not that hard!
## # A tibble: 4 x 2
##   JobInvolvement JobInvolvement_value
##            <dbl> <fct>               
## 1              1 Low                 
## 2              2 Medium              
## 3              3 High                
## 4              4 Very High

This can now be merged with the training dataset!

Iteratively join the dataframes within the definitions list with the main dataframe (training data)

This is new - study and learn!

  • Make the train_raw_tbl at list and call it HRData - weird, huh!
    • make a list but the original training DF is the only entry called HRData
  • append adds elements to a vector or list
    • The list grows because the first element is the training DF followed by all the elements in the definitions list
    • everything is in one list - helpful to use lists to collect objects that need to be iterated over with purrr
    • recall we are doing this so values are replaced by descriptions in the training data
  • Use reduce which iteratively applies a user specified function to successive binary sets of objects. A 3-element vector would have a function applied to the first 2 elements and that output would then have the function applied with the 3rd element
    • here it adds columns like EducationField joined on the Education numerical value - human readable!
  • Remove the original training data field names, like Education since we have just joined the more descriptive fields like Education_value
    • Using negative one_of to remove them
  • Rename the remaining fields like Education_value to a more useful nomenclature
    • using set_names from base to do this leverage str_replace_all to remove _value
  • Resort the dataframe
## Observations: 1,250
## Variables: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No",...
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently",...
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department               <chr> "Sales", "Research & Development", "R...
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education                <fct> College, Below College, College, Mast...
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Ot...
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14,...
## $ EnvironmentSatisfaction  <fct> Medium, High, Very High, Very High, L...
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "...
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement           <fct> High, Medium, Medium, High, High, Hig...
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole                  <chr> "Sales Executive", "Research Scientis...
## $ JobSatisfaction          <fct> Very High, Medium, High, High, Medium...
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Marri...
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No"...
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating        <fct> Excellent, Outstanding, Excellent, Ex...
## $ RelationshipSatisfaction <fct> Low, Very High, Medium, High, Very Hi...
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance          <fct> Bad, Better, Better, Better, Better, ...
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...

7.1.3 Factoring Character Data

There are still some character variables:

## Observations: 1,250
## Variables: 9
## $ Attrition      <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "No...
## $ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "Travel_R...
## $ Department     <chr> "Sales", "Research & Development", "Research & ...
## $ EducationField <chr> "Life Sciences", "Life Sciences", "Other", "Lif...
## $ Gender         <chr> "Female", "Male", "Male", "Female", "Male", "Ma...
## $ JobRole        <chr> "Sales Executive", "Research Scientist", "Labor...
## $ MaritalStatus  <chr> "Single", "Married", "Single", "Married", "Marr...
## $ Over18         <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y...
## $ OverTime       <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes", "...

Need to transformed into factors. Evaluate if the features are ordered.

## # A tibble: 3 x 1
##   BusinessTravel   
##   <chr>            
## 1 Travel_Rarely    
## 2 Travel_Frequently
## 3 Non-Travel

The natural order should be: Non_Travel –> Travel_Rarely –> Travel_Frequently. Change it!

Evaluate the levels of the factors:

## $Attrition
## [1] "No"  "Yes"
## 
## $BusinessTravel
## [1] "Non-Travel"        "Travel_Frequently" "Travel_Rarely"    
## 
## $Department
## [1] "Human Resources"        "Research & Development"
## [3] "Sales"                 
## 
## $Education
## [1] "Below College" "College"       "Bachelor"      "Master"       
## [5] "Doctor"       
## 
## $EducationField
## [1] "Human Resources"  "Life Sciences"    "Marketing"       
## [4] "Medical"          "Other"            "Technical Degree"
## 
## $EnvironmentSatisfaction
## [1] "Low"       "Medium"    "High"      "Very High"

BusinessTravel and MartitalStatus are the only factor that need to be re-leveled.

## $BusinessTravel
## [1] "Non-Travel"        "Travel_Rarely"     "Travel_Frequently"
## 
## $MaritalStatus
## [1] "Single"   "Married"  "Divorced"

7.3 Data Preparation With Recipes!

This chapter is about preparing data for correlation analysis. Correlation Analysis is a great way to determine if you are getting good features prior to modeling. Always start with a correlation analysis.

7.3.1 Custom Function: plot_hist_facet

To inspect feature distributions and identify transformations needed to make them properly formatted for correlation analysis.

  • palette()[[3]] is just a hex code “18BC9C”
  • as.factor orders factors alphabetically (as_factor maintains the order in which they appear)
  • using as.factor and as.numeric covers cases for characters and/or factors - noting special here
  • gather does wide df to long df
  • fact_rev simply reverses the order of the factors
  • facet_wrap splits data into multiple graphs by a categorical columns - or set of columns -. Use functional format to specify facets (~key)

Here is the output as an example:

Useful to make the label the first plot

7.3.2 recipes: Preprocessing Data For Machines

The idea of the recipes package is to define a recipe or blueprint that can be used to sequentially define the encodings and preprocessing of the data (i.e. feature engineering).

In recipes, there are no constraints related to the order in which steps are added to the recipe. However, there are some general suggestions that you should consider:

  • If using a Box-Cox transformation, don’t center the data first or do any operations that might make the data non-positive. Alternatively, use the Yeo-Johnson transformation so you don’t have to worry about this.
  • Recipes do not automatically create dummy variables (unlike most formula methods). If you want to center, scale, or do any other operations on all of the predictors, run step_dummy first so that numeric columns are in the data set instead of factors.
  • You should make dummy variables before creating the interactions.
  • If you are lumping infrequently categories together with step_other, call step_other before step_dummy.

Here is a suggested order of potential steps that should work for most problems:

  1. Impute
  2. Individual transformations
    • changes to the data to remove skew (e.g. log), stabilize variance (e.g. Box Cox) or make stationary (e.g. difference for time series)
    • Normality is required for linear models that depend on correlation (correlation analysis, linear and logistic regression)
    • Non-linear models like random forest handle non-linear data because they depend on how the data can be segregated
  3. Discretize (if needed and if you have no other choice)
    • discretation can hurt correlations. Often best not to discretize unless there is a specific need to do so
      • Example: Explain the difference between cohorts like Millennials vs Generation X
  4. Create dummy variables
  5. Create interactions
    • When 2 features have a relationship to each other there are said to interact. Example: The ratio between height and weight of a person is related to obesity. Height and weight interact.
    • Some advanced modeling - deep learning - detect interactions automatically
  6. Normalization steps (center, scale, range, etc) Prefer to make this #4 before dummy variables
    • Some ML algorithms (PCA, KNN, Deep Learning) require scaled data
  7. Multivariate transformation (e.g. PCA, spatial sign, etc)
    • Useful when data is wide and can be susceptible to over-fitting

For our data, here is the recipe plan we will implement:

  • Plan: Correlation Analysis
    • Always do correlation analysis before modeling
    • Impute/Zero Variance
      • Imputation: The act of filling in missing values within features.
      • Common methods are:
        • filling in by recency - tidyr: fill
        • filling by similarity - knn impute
    • Transformations
    • Center and scale Dummy Data

We do not have missing values in the current data. However there are features with no variance.

recipes: Setup

First create a recipe from the original data and then specify the processing steps. Recipes can be created manually by sequentially adding roles to variables in a data set. If the analysis only required outcomes and predictors, the easiest way to create the initial recipe is to use the standard formula method:

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         34

The data contained in the data argument need not be the training set; this data is only used to catalog the names of the variables and their types (e.g. numeric, etc.).

recipes: Zero Variance Features

To add a step to remove zero variance features, it is simple to do:

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         34
## 
## Operations:
## 
## Zero variance filter on all_predictors()

prep() prepares the recipe performing preliminary calculations. Useful to see what the step is doing.

  • prep() does not transform the data - it just determines is is needed to transform
  • bake() performs transformation in new data (even if it is the same data you have already used) - bake takes as its newdata argument the name of the dataset typically used in the setup: recipe_obj <- recipe(Attrition ~., data = train_readable_tbl). Useful to apply the recipe to any data set (commonly test and train)
  • Essentially 3 steps:
    1. Create the instructions with the recipe and steps
    2. Prepare the recipe (prep())
    3. bake the recipe (`bake(newdata = testdataname)’)
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         34
## 
## Training data contained 1250 data points and no missing data.
## 
## Operations:
## 
## Zero variance filter removed EmployeeCount, ... [trained]

It is helpful to know 3 features will be removed when the recipe is implemented.

recipes: Transformations

Identify which features have high skewness. Could use the facet plot illustrate above or compute using code (helpful when you have many features):

High Value –> fat tail on left | Low Value –> fat tail on right

skewness is a function from PerformanceAnalytics (package is loaded with tidyquant) computes the skewness of an univariate distribution.

factor_key preserves the order of the variables

## # A tibble: 19 x 2
##    key                          value
##    <fct>                        <dbl>
##  1 YearsSinceLastPromotion    1.94   
##  2 YearsAtCompany             1.76   
##  3 MonthlyIncome              1.35   
##  4 TotalWorkingYears          1.11   
##  5 NumCompaniesWorked         1.02   
##  6 JobLevel                   1.02   
##  7 DistanceFromHome           0.985  
##  8 StockOptionLevel           0.968  
##  9 YearsInCurrentRole         0.933  
## 10 PercentSalaryHike          0.802  
## 11 YearsWithCurrManager       0.802  
## 12 TrainingTimesLastYear      0.557  
## 13 Age                        0.422  
## 14 EmployeeNumber             0.0121 
## 15 MonthlyRate                0.0100 
## 16 DailyRate                  0.00578
## 17 HourlyRate                -0.0263 
## 18 EmployeeCount            NaN      
## 19 StandardHours            NaN

Judgment is used to determine a cutoff for what is/not skewed. Evaluate the values from above:

  • Appears to be a big difference between YearsWithCurrManager = 0.802 and TrainingTimesLastYear = 0.55

The facet plot indicates this too. as.character is used so we can plot. . . .

##  [1] "YearsSinceLastPromotion" "YearsAtCompany"         
##  [3] "MonthlyIncome"           "TotalWorkingYears"      
##  [5] "NumCompaniesWorked"      "JobLevel"               
##  [7] "DistanceFromHome"        "StockOptionLevel"       
##  [9] "YearsInCurrentRole"      "PercentSalaryHike"      
## [11] "YearsWithCurrManager"

With this information, plot the features:

The plot shows JobLevel and StockOptionLevel are factors and therefore are not transformed here (wait for dummying)

We now have the information we need to continue building our recipe:

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         34
## 
## Training data contained 1250 data points and no missing data.
## 
## Operations:
## 
## Zero variance filter removed EmployeeCount, ... [trained]
## Yeo-Johnson transformation on 9 items [trained]
## Factor variables from JobLevel, StockOptionLevel [trained]

step_YeoJohnson - power transformation. Instead of sqrt, looks for diff roots b/t -5 to 5 to eliminate skew

To make sure the recipe worked, facet plot the numeric features that were the target of the YeoJohnson transformation:

Skewness is significantly reduced!

recipes: center & scale

Required for algos that require scaling including kmeans, Deep Learning, PCA, SVMs.

Scaling does not typically hurt algos - so do it

Note the x axis values are different from one and other - different ranges. DailyRate would likely dominate!

Must do in order: Center then Scale (C before S)

Note - H2O does not require centering and scaling - more on this later.

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         34
## 
## Operations:
## 
## Zero variance filter on all_predictors()
## Yeo-Johnson transformation on skewed_feature_names
## Factor variables from factor_names
## Centering for all_numeric()
## Scaling for all_numeric()

The output from the recipe object is helpful but we are interested in understanding what it is really doing behind the scenes:

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         34
## 
## Training data contained 1250 data points and no missing data.
## 
## Operations:
## 
## Zero variance filter removed EmployeeCount, ... [trained]
## Yeo-Johnson transformation on 9 items [trained]
## Factor variables from JobLevel, StockOptionLevel [trained]
## Centering for Age, DailyRate, ... [trained]
## Scaling for Age, DailyRate, ... [trained]

All the x-axis scales are updated for the independent features!

recipes: dummy variables

One hot encoding - expands categorical variables into separate columns of 1/0. Important to detect patterns in unordered data.

all_nominal() selects only categorical/factor data

To see what output we will get, evaluate only JobRole as our example:

JobRole has been expanded to 8 columns (1 less than the number of unique values in JobRole. Healthcare Representative Manager was left out).

recipe::step_dummy was updated with a new option - one_hot: A logical. For C levels, should C dummy variables be created rather than C-1? Very useful for plotting!

recipes: Final Recipe

Transform the data. . .. first review what the recipe will do . . . .

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         34
## 
## Training data contained 1250 data points and no missing data.
## 
## Operations:
## 
## Zero variance filter removed EmployeeCount, ... [trained]
## Yeo-Johnson transformation on 9 items [trained]
## Factor variables from JobLevel, StockOptionLevel [trained]
## Centering for Age, DailyRate, ... [trained]
## Scaling for Age, DailyRate, ... [trained]
## Dummy variables from BusinessTravel, Department, ... [trained]

All of this effort is so we can perform a correlation analysis.

Note that the recipe changes our target label:

## [1] "numeric"

7.4 Correlation Analysis

Note: Good introduction found here. May also help expand this to include non-linear relationships.

Correlation analysis only identifies linear relationships - if there is an exponential relationship

(A random forest will find the non-linear relationship.)

If there are no features with relatively strong correlation, it might be wise to collect more data.

Correlation is a good barometer but not necessarily a definitive answer - it is a guide.

7.4.1 get_cor

A function to take data and measure correlation.

  • pairwise.complete.obs is not the default but it is what you nearly always want to use
    • If you use the default - “everything” - risk of getting missing value errors
  • fct_reorder reorders factor by another column. Changes the levels of a factor but not the DF - use arrange for that
  • correlation analysis only works with numeric data
  • quo_name changes the stored value to “Attrition_Yes” from a quosure
  • mutate_if(is.character, as.factor) %>% mutate_if(is.factor, as.numeric) prevents function from failing is there is character data
  • cor from stats returns a square correlation df correlating every feature against all the others. The return is square meaning the \(number of rows = number of columns\)
    • use is an optional character string giving a method for computing covariances in the presence of missing values.
      • This must be (an abbreviation of) one of the strings “everything”, “all.obs”, “complete.obs”, “na.or.complete”, or “pairwise.complete.obs”
      • If use is “everything”, NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA.
      • If use is “all.obs”, then the presence of missing observations will produce an error. If use is “complete.obs” then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error).
      • “na.or.complete” is the same unless there are no complete cases, that gives NA.
      • If use has the value “pairwise.complete.obs” then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables.
  • The (.) in the names function simply represents the data in the function. This creates a new character column called feature. this makes is square.
  • select(feature, !! feature_expr) produces a tibble with 2 columns - the new feature column and the Attrition_Yes values
  • filter(!(feature == feature_name)) simply removes the value in the feature column that equals Attrition_Yes - it simply is not needed/redundant
  • Recall as_factor maintains the order in which the values appear - not alphabetized
  • fct_reorder reorders a factor by another column. In this case, reorder by the Attrition_Yes value. Helpful! Requires arrange() to update the df.
  • fct_rev reverses the ordering from the previous line of code

fct_reorder and fct_rev is useful for plotting

Example of get_cor()

## Warning in stats::cor(x, ...): the standard deviation is zero
## # A tibble: 65 x 2
##    feature                          Attrition_Yes
##    <fct>                                    <dbl>
##  1 PerformanceRating_Good                 NA     
##  2 OverTime_Yes                            0.240 
##  3 JobRole_Sales.Representative            0.153 
##  4 BusinessTravel_Travel_Frequently        0.103 
##  5 Department_Sales                        0.0891
##  6 DistanceFromHome                        0.0822
##  7 JobRole_Laboratory.Technician           0.0737
##  8 EducationField_Technical.Degree         0.0731
##  9 JobRole_Human.Resources                 0.0718
## 10 JobInvolvement_Medium                   0.0604
## # ... with 55 more rows
  • Good correlation with real-world data could be +/-0.1 or greater. It just depends on the problem. It’s very often that the data is messy, but if you get one good feature or a few moderately good features (e.g. 0.2 or more) then this could be good.
  • The goal with the preprocessing is to help improve the correlation even if the relationship is nonlinear. For example that is why we do YeoJohnson on some of the skewed data. However, there may be other ways. We could definitely run a random forest and that will pick up any non linearity. We could then try to get feature importance.

7.4.2 Note on Multicolinearity

Multicolinearity is when two features in a model are highly correlated because it relates to whether or not the coefficients are uniquely identified. This relates to explanatory power of the model.

The goals are different for machine learning and traditional statistics, which is why multicolinearity is not traditionally checked for in ML but is in statistics.

Machine Learning: Assessing multicolinearity is not typically a concern in machine learning where the goal is to get a high performance prediction and more features helps in this pursuit. ML models can be linear or nonlinear and typically have regularization which fights over fitting, a situation that can be caused by many features that are correlated. This enables extreme accuracy, but the feature weights (coefficients) get lost, which reduces the explainability of the model.

In machine learning, we will have hundreds of features, but a few of the features will likely dominate for any given observation. This is where LIME comes in. LIME is a tool that allows us to cut through to the core of the features that are driving the explanation for any given model. It approximates nonlinear models locally, which enables us to detect which features are driving the prediction. It’s not impervious to colinear features but this is why we need to understand what features are going into a model from a business-intuition standpoint and discuss features with Subject Matter Experts. See this resource for more discussion on the LIME topic: https://github.com/marcotcr/lime/issues/178

Traditional Statistics: In traditional statistics it’s very common to inspect for multicolinearity and remove features. This is because the goal is to build a model using few features that explain the majority of the variance. When removing features, we can assess multicolinearity

The Debate: Because of the two sides, it’s debatable and it depends on your goals of high performance vs ability to explain with as few features as possible.

My Opinion: When it comes to modeling, my opinion is that we should strive for a high performing model once we know we have good features (features that have some predictive power based on a correlation assessment). When it comes to selecting features, we should meet with Subject Matter Experts and pick reasonable features to build a model around doing our best to not include the same feature twice, which is often where multicolinearity arises from (e.g. including birth year and age in the same model). When it comes to explaining, we should use tools like LIME that have the ability to hone in on the features that are contributing most to the prediction.

For more on this topic: https://stats.stackexchange.com/questions/168622/why-is-multicollinearity-not-checked-in-modern-statistics-machine-learning

7.4.3 plot_cor

The data produced by get_cor() is terrific, but understanding the data quickly is accomplished by creating a visualization.

The function below produces an informative visualization of the get_cor() output. Do not be overwhelmed by the function. Many of the parameters are simply ggplot parameters.

  • The beginning of the function looks much like get_cor()
  • feature_name_text is simply the Attriiotion_Yes value rounded to the number of digits specified by lbl_precision
  • New feature named Correlation is created based on the value of Attrition_Yes and outputs a factor = Positive or Negative
  • Experiment with the ggplot code to understand how it works. (First set values for the function parameters.)

7.4.4 Evaluating plot_cor Output

Data have been prepared and we developed a visualization function to easily evaluate feature correlations.

  1. Descriptive Features: age, gender, marital status, etc

Observations:

  • As age increases, less likely to leave company
    • DistanceFromHome has a relationship as dsitance from home supports someone more likely to leaeave the company
  1. Employment Features

Observations:

  • Certain jobs either support or contradict the lieklihood to leave
  1. Compensation Features

Observations:

-MonhtlyIncome StockOptionLevel are impactful

  1. Survey Results

  1. Performance Data

  1. Work-Life Features

  1. Training and Education

  1. Time-Based Features

8 Automated Machine Learning With H2O

A few random H2O notes:

  • H2O automatically centers and scales the data. Most transformations are unnecessary.
  • The dummy variables are handled internally by H2O provided that the format supplied is a factor.
  • H2O may not complain, but LIME certainly will. My rule on missing data (and outliers) - Don’t ever leave missing data (or outliers) to chance, handle it yourself.
  • Recommend doing feature engineering yourself because it’s usually better to apply some business knowledge rather than leave up to an algorithm that supposedly handles feature engineering.
  • AutoML performs cross validation and grid search automatically! (We will learn about it anyway because you must know this as a data scientist!)

H2O has great documentation.

8.1 H2O Modeling

Update Since Original Modeling The leaderboard_frame is a legacy argument that is not necessary and the data is better served being used for the training and/or validation. Therefore, you should only perform one split (training and validation) as opposed to two (training, validation, test - used for leaderboard_frame). This should increase model performance on the data set. You can try it both ways to see what happens. The leaderboard rankings performed later will be made on the Cross Validation metrics.

The main issue is that it reduces the amount of training data unnecessarily by using the leaderboard. The leaderboard used to be used for creating the Stacked Ensemble models. Now it is a legacy because of using the cross validation as a ranking mechanism. So what happens is that if we split off a set for the leaderboard, then we lose data that could have been used for training. Instead of doing the 70% training, 15% validation (test), and 15% leaderboard (test), we can now combine the leaderboard and training sets to get a higher accuracy model and just ranking by Cross Validation, which is performed on the training set internally within h2o.automl(). The split should be between 85% training and 15% validation. Leave the leaderboard = NULL, which is the preferred approach now.

  • max_runtime_secs = 30: default is 3600 = 1 hour. Once results look promising, increase run time to get more models with highly tuned parameters
  • nfolds = 5: 10 folds more standard but trying to reduce model runtimes for this experiment

8.1.1 Inspecting The Leaderboard

There is much data produced by the individual models - the leader is shown below:

## Model Details:
## ==============
## 
## H2OBinomialModel: glm
## Model ID:  GLM_grid_0_AutoML_20181028_212737_model_0 
## GLM Model: summary
##     family  link            regularization
## 1 binomial logit Ridge ( lambda = 0.0042 )
##                                                                  lambda_search
## 1 nlambda = 30, lambda.max = 6.2463, lambda.min = 0.0042, lambda.1se = 0.02055
##   number_of_predictors_total number_of_active_predictors
## 1                         79                          79
##   number_of_iterations                    training_frame
## 1                   54 _804567d077a2a0701a19fb84ffc740ad
## 
## Coefficients: glm coefficients
##                               names coefficients standardized_coefficients
## 1                         Intercept     0.849792                 -1.417281
## 2 JobRole.Healthcare Representative     0.066056                  0.066056
## 3           JobRole.Human Resources     0.167535                  0.167535
## 4     JobRole.Laboratory Technician     0.294450                  0.294450
## 5                   JobRole.Manager    -0.188477                 -0.188477
## 
## ---
##                      names coefficients standardized_coefficients
## 75       TotalWorkingYears    -0.042385                 -0.330973
## 76   TrainingTimesLastYear    -0.180422                 -0.231657
## 77          YearsAtCompany     0.102775                  0.649596
## 78      YearsInCurrentRole    -0.107887                 -0.399926
## 79 YearsSinceLastPromotion     0.098246                  0.321247
## 80    YearsWithCurrManager    -0.118166                 -0.425507
## 
## H2OBinomialMetrics: glm
## ** Reported on training data. **
## 
## MSE:  0.077
## RMSE:  0.28
## LogLoss:  0.27
## Mean Per-Class Error:  0.19
## AUC:  0.9
## Gini:  0.8
## R^2:  0.44
## Residual Deviance:  573
## AIC:  733
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error       Rate
## No     838  51 0.057368    =51/889
## Yes     58 118 0.329545    =58/176
## Totals 896 169 0.102347  =109/1065
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.348528 0.684058 128
## 2                       max f2  0.173866 0.739300 220
## 3                 max f0point5  0.517226 0.767857  79
## 4                 max accuracy  0.517226 0.906103  79
## 5                max precision  0.962484 1.000000   0
## 6                   max recall  0.006173 1.000000 388
## 7              max specificity  0.962484 1.000000   0
## 8             max absolute_mcc  0.348528 0.623206 128
## 9   max min_per_class_accuracy  0.189074 0.825647 210
## 10 max mean_per_class_accuracy  0.218153 0.836509 190
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: glm
## ** Reported on validation data. **
## 
## MSE:  0.086
## RMSE:  0.29
## LogLoss:  0.3
## Mean Per-Class Error:  0.24
## AUC:  0.81
## Gini:  0.62
## R^2:  0.26
## Residual Deviance:  112
## AIC:  272
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error     Rate
## No     148  12 0.075000  =12/160
## Yes     10  15 0.400000   =10/25
## Totals 158  27 0.118919  =22/185
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.314235 0.576923  26
## 2                       max f2  0.167744 0.620915  52
## 3                 max f0point5  0.648042 0.612245   5
## 4                 max accuracy  0.648042 0.897297   5
## 5                max precision  0.933029 1.000000   0
## 6                   max recall  0.010212 1.000000 158
## 7              max specificity  0.933029 1.000000   0
## 8             max absolute_mcc  0.314235 0.508369  26
## 9   max min_per_class_accuracy  0.167744 0.760000  52
## 10 max mean_per_class_accuracy  0.167744 0.773750  52
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: glm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.094
## RMSE:  0.31
## LogLoss:  0.32
## Mean Per-Class Error:  0.24
## AUC:  0.85
## Gini:  0.69
## R^2:  0.32
## Residual Deviance:  684
## AIC:  844
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error       Rate
## No     799  90 0.101237    =90/889
## Yes     65 111 0.369318    =65/176
## Totals 864 201 0.145540  =155/1065
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.303861 0.588859 155
## 2                       max f2  0.192745 0.659068 208
## 3                 max f0point5  0.481045 0.651408  86
## 4                 max accuracy  0.481045 0.881690  86
## 5                max precision  0.949434 1.000000   0
## 6                   max recall  0.002279 1.000000 396
## 7              max specificity  0.949434 1.000000   0
## 8             max absolute_mcc  0.359402 0.511298 127
## 9   max min_per_class_accuracy  0.173231 0.772727 219
## 10 max mean_per_class_accuracy  0.192745 0.781103 208
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                 mean          sd cv_1_valid cv_2_valid cv_3_valid
## accuracy  0.87323946  0.02347418 0.84976524 0.82159626 0.90140843
## auc        0.8425771 0.026760576  0.8296127  0.8300225  0.9127907
## err       0.12676056  0.02347418 0.15023474 0.17840375 0.09859155
## err_count       27.0         5.0       32.0       38.0       21.0
## f0point5  0.63031274  0.07437051  0.6377551 0.45064378  0.7763975
##            cv_4_valid cv_5_valid
## accuracy   0.91079813  0.8826291
## auc         0.7989362 0.84152335
## err       0.089201875 0.11737089
## err_count        19.0       25.0
## f0point5    0.6179775  0.6687898
## 
## ---
##                         mean          sd cv_1_valid cv_2_valid cv_3_valid
## precision          0.6580986 0.096957624 0.65789473  0.4117647  0.8333333
## r2                0.30837095 0.061606336  0.2931284 0.25788376  0.4637525
## recall             0.5819287  0.06444905  0.5681818  0.7241379  0.6097561
## residual_deviance  135.58003   11.968308  165.20996  127.08945  117.54124
## rmse              0.30468094 0.014051158 0.34037712  0.2954366 0.28870824
## specificity        0.9306463 0.035521794  0.9230769  0.8369565  0.9709302
##                   cv_4_valid cv_5_valid
## precision             0.6875        0.7
## r2                0.20434217 0.32274798
## recall                  0.44  0.5675676
## residual_deviance   125.3069  142.75262
## rmse              0.28709954 0.31178322
## specificity        0.9734042  0.9488636

H2O even provides a leaderboard - how cool is that?

##                                                model_id  auc logloss
## 1             GLM_grid_0_AutoML_20181028_212737_model_0 0.85    0.32
## 2 StackedEnsemble_BestOfFamily_0_AutoML_20181028_212737 0.84    0.32
## 3    StackedEnsemble_AllModels_0_AutoML_20181028_212737 0.84    0.32
## 4             GBM_grid_0_AutoML_20181028_212737_model_4 0.82    0.34
## 5             GBM_grid_0_AutoML_20181028_212737_model_5 0.80    0.36
## 6             GBM_grid_0_AutoML_20181028_212737_model_1 0.80    0.36
##   mean_per_class_error rmse   mse
## 1                 0.24 0.31 0.094
## 2                 0.23 0.31 0.093
## 3                 0.24 0.30 0.093
## 4                 0.25 0.32 0.103
## 5                 0.28 0.33 0.109
## 6                 0.27 0.33 0.107
## 
## [21 rows x 6 columns]

8.1.2 Extracting Models From The Leaderboard

Getting the leader is pretty simple:

## Model Details:
## ==============
## 
## H2OBinomialModel: glm
## Model ID:  GLM_grid_0_AutoML_20181028_212737_model_0 
## GLM Model: summary
##     family  link            regularization
## 1 binomial logit Ridge ( lambda = 0.0042 )
##                                                                  lambda_search
## 1 nlambda = 30, lambda.max = 6.2463, lambda.min = 0.0042, lambda.1se = 0.02055
##   number_of_predictors_total number_of_active_predictors
## 1                         79                          79
##   number_of_iterations                    training_frame
## 1                   54 _804567d077a2a0701a19fb84ffc740ad
## 
## Coefficients: glm coefficients
##                               names coefficients standardized_coefficients
## 1                         Intercept     0.849792                 -1.417281
## 2 JobRole.Healthcare Representative     0.066056                  0.066056
## 3           JobRole.Human Resources     0.167535                  0.167535
## 4     JobRole.Laboratory Technician     0.294450                  0.294450
## 5                   JobRole.Manager    -0.188477                 -0.188477
## 
## ---
##                      names coefficients standardized_coefficients
## 75       TotalWorkingYears    -0.042385                 -0.330973
## 76   TrainingTimesLastYear    -0.180422                 -0.231657
## 77          YearsAtCompany     0.102775                  0.649596
## 78      YearsInCurrentRole    -0.107887                 -0.399926
## 79 YearsSinceLastPromotion     0.098246                  0.321247
## 80    YearsWithCurrManager    -0.118166                 -0.425507
## 
## H2OBinomialMetrics: glm
## ** Reported on training data. **
## 
## MSE:  0.077
## RMSE:  0.28
## LogLoss:  0.27
## Mean Per-Class Error:  0.19
## AUC:  0.9
## Gini:  0.8
## R^2:  0.44
## Residual Deviance:  573
## AIC:  733
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error       Rate
## No     838  51 0.057368    =51/889
## Yes     58 118 0.329545    =58/176
## Totals 896 169 0.102347  =109/1065
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.348528 0.684058 128
## 2                       max f2  0.173866 0.739300 220
## 3                 max f0point5  0.517226 0.767857  79
## 4                 max accuracy  0.517226 0.906103  79
## 5                max precision  0.962484 1.000000   0
## 6                   max recall  0.006173 1.000000 388
## 7              max specificity  0.962484 1.000000   0
## 8             max absolute_mcc  0.348528 0.623206 128
## 9   max min_per_class_accuracy  0.189074 0.825647 210
## 10 max mean_per_class_accuracy  0.218153 0.836509 190
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: glm
## ** Reported on validation data. **
## 
## MSE:  0.086
## RMSE:  0.29
## LogLoss:  0.3
## Mean Per-Class Error:  0.24
## AUC:  0.81
## Gini:  0.62
## R^2:  0.26
## Residual Deviance:  112
## AIC:  272
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error     Rate
## No     148  12 0.075000  =12/160
## Yes     10  15 0.400000   =10/25
## Totals 158  27 0.118919  =22/185
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.314235 0.576923  26
## 2                       max f2  0.167744 0.620915  52
## 3                 max f0point5  0.648042 0.612245   5
## 4                 max accuracy  0.648042 0.897297   5
## 5                max precision  0.933029 1.000000   0
## 6                   max recall  0.010212 1.000000 158
## 7              max specificity  0.933029 1.000000   0
## 8             max absolute_mcc  0.314235 0.508369  26
## 9   max min_per_class_accuracy  0.167744 0.760000  52
## 10 max mean_per_class_accuracy  0.167744 0.773750  52
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: glm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.094
## RMSE:  0.31
## LogLoss:  0.32
## Mean Per-Class Error:  0.24
## AUC:  0.85
## Gini:  0.69
## R^2:  0.32
## Residual Deviance:  684
## AIC:  844
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error       Rate
## No     799  90 0.101237    =90/889
## Yes     65 111 0.369318    =65/176
## Totals 864 201 0.145540  =155/1065
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.303861 0.588859 155
## 2                       max f2  0.192745 0.659068 208
## 3                 max f0point5  0.481045 0.651408  86
## 4                 max accuracy  0.481045 0.881690  86
## 5                max precision  0.949434 1.000000   0
## 6                   max recall  0.002279 1.000000 396
## 7              max specificity  0.949434 1.000000   0
## 8             max absolute_mcc  0.359402 0.511298 127
## 9   max min_per_class_accuracy  0.173231 0.772727 219
## 10 max mean_per_class_accuracy  0.192745 0.781103 208
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                 mean          sd cv_1_valid cv_2_valid cv_3_valid
## accuracy  0.87323946  0.02347418 0.84976524 0.82159626 0.90140843
## auc        0.8425771 0.026760576  0.8296127  0.8300225  0.9127907
## err       0.12676056  0.02347418 0.15023474 0.17840375 0.09859155
## err_count       27.0         5.0       32.0       38.0       21.0
## f0point5  0.63031274  0.07437051  0.6377551 0.45064378  0.7763975
##            cv_4_valid cv_5_valid
## accuracy   0.91079813  0.8826291
## auc         0.7989362 0.84152335
## err       0.089201875 0.11737089
## err_count        19.0       25.0
## f0point5    0.6179775  0.6687898
## 
## ---
##                         mean          sd cv_1_valid cv_2_valid cv_3_valid
## precision          0.6580986 0.096957624 0.65789473  0.4117647  0.8333333
## r2                0.30837095 0.061606336  0.2931284 0.25788376  0.4637525
## recall             0.5819287  0.06444905  0.5681818  0.7241379  0.6097561
## residual_deviance  135.58003   11.968308  165.20996  127.08945  117.54124
## rmse              0.30468094 0.014051158 0.34037712  0.2954366 0.28870824
## specificity        0.9306463 0.035521794  0.9230769  0.8369565  0.9709302
##                   cv_4_valid cv_5_valid
## precision             0.6875        0.7
## r2                0.20434217 0.32274798
## recall                  0.44  0.5675676
## residual_deviance   125.3069  142.75262
## rmse              0.28709954 0.31178322
## specificity        0.9734042  0.9488636

To evaluate a model other than the leader, you must use the model_id in the leaderboard:

The code above is commented only because when the RMD is knitted, different models will be built and the model_id will change.

8.1.3 Custom Function: extract_h2o_model_by_position

## StackedEnsemble_AllModels_0_AutoML_20181028_212737
## Model Details:
## ==============
## 
## H2OBinomialModel: stackedensemble
## Model ID:  StackedEnsemble_AllModels_0_AutoML_20181028_212737 
## NULL
## 
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
## 
## MSE:  0.051
## RMSE:  0.23
## LogLoss:  0.19
## Mean Per-Class Error:  0.13
## AUC:  0.96
## Gini:  0.93
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error      Rate
## No     863  26 0.029246   =26/889
## Yes     39 137 0.221591   =39/176
## Totals 902 163 0.061033  =65/1065
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.372597 0.808260 140
## 2                       max f2  0.230229 0.826039 179
## 3                 max f0point5  0.621124 0.860927  90
## 4                 max accuracy  0.419274 0.938967 132
## 5                max precision  0.983435 1.000000   0
## 6                   max recall  0.058457 1.000000 336
## 7              max specificity  0.983435 1.000000   0
## 8             max absolute_mcc  0.372597 0.772834 140
## 9   max min_per_class_accuracy  0.168999 0.889764 210
## 10 max mean_per_class_accuracy  0.230229 0.895794 179
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: stackedensemble
## ** Reported on validation data. **
## 
## MSE:  0.083
## RMSE:  0.29
## LogLoss:  0.29
## Mean Per-Class Error:  0.21
## AUC:  0.82
## Gini:  0.65
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error     Rate
## No     143  17 0.106250  =17/160
## Yes      8  17 0.320000    =8/25
## Totals 151  34 0.135135  =25/185
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.225906 0.576271  33
## 2                       max f2  0.092863 0.644172  62
## 3                 max f0point5  0.850666 0.612245   5
## 4                 max accuracy  0.850666 0.897297   5
## 5                max precision  0.982857 1.000000   0
## 6                   max recall  0.041520 1.000000 146
## 7              max specificity  0.982857 1.000000   0
## 8             max absolute_mcc  0.225906 0.506436  33
## 9   max min_per_class_accuracy  0.105688 0.760000  55
## 10 max mean_per_class_accuracy  0.092863 0.788750  62
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.093
## RMSE:  0.3
## LogLoss:  0.32
## Mean Per-Class Error:  0.24
## AUC:  0.84
## Gini:  0.68
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error       Rate
## No     807  82 0.092238    =82/889
## Yes     67 109 0.380682    =67/176
## Totals 874 191 0.139906  =149/1065
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.278936 0.594005 157
## 2                       max f2  0.150185 0.645325 219
## 3                 max f0point5  0.505251 0.667808  88
## 4                 max accuracy  0.505251 0.885446  88
## 5                max precision  0.982169 1.000000   0
## 6                   max recall  0.033742 1.000000 393
## 7              max specificity  0.982169 1.000000   0
## 8             max absolute_mcc  0.437944 0.526086 102
## 9   max min_per_class_accuracy  0.113342 0.755682 253
## 10 max mean_per_class_accuracy  0.150185 0.774744 219
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

8.1.4 Saving & Loading H2O Models

Saving models is simple. First use h2o.getModel then save it by running h2o.saveModel. Easy! Below the code is not evaluated because each time the RMD is knitted, the models will change.

Loaded as saved model is equally simple using h2o.loadModel:

getModel returns a reference to an existing model in the H2O instance. loadModel loads a saved H2O model from disk.

8.1.5 Making Predictions

Load a model:

Make predictions with the model using test_tbl and h2o.predict - again, simple!

## # A tibble: 6 x 3
##   predict    No      Yes
##   <fct>   <dbl>    <dbl>
## 1 No      0.981 0.0195  
## 2 No      0.803 0.197   
## 3 No      0.999 0.000884
## 4 No      0.974 0.0260  
## 5 Yes     0.338 0.662   
## 6 No      0.923 0.0766

For H2O Binary Classification, H2O provides 3 columns of data:

  1. Class prediction
  2. 1st class probability
  3. 2nd class probability

8.2 Advanced Concepts

8.2.1 Train, Validation, & Leaderboard Frames

The models are trained on the training data and then sends the results to the validation data set and learns from the differences.

Repeated Note: The leaderboard_frame is a legacy argument that is not necessary and the data is better served being used for the training and/or validation. Therefore, you should only perform one split (training and validation) as opposed to two (training, validation, test - used for leaderboard_frame).

8.2.2 H2O AutoML Model Parameters

When extracting the model parameters, there sufficient information so the model can be run outside of the H2o framework.H2O automated ML model leaderboard returns a bunch of models. Each model can be extracted and the contents can be opened up to determine the exact model parameters. The only one that will be difficult to recreate is stacked ensemble since this is a combination of models. Others such as GLM, GBM, and Deep Learning can be inspected and recreated. The Bonus Lecture on Grid Search.

8.2.2.1 Deep Learning Parameters

Use the @parameters option with the model:

## $model_id
## [1] "GLM_grid_0_AutoML_20181028_212737_model_0"
## 
## $training_frame
## [1] "_804567d077a2a0701a19fb84ffc740ad"
## 
## $validation_frame
## [1] "RTMP_sid_bdf4_5"
## 
## $nfolds
## [1] 5
## 
## $seed
## [1] -5728624559962048512
## 
## $keep_cross_validation_predictions
## [1] TRUE
## 
## $fold_assignment
## [1] "Modulo"
## 
## $family
## [1] "binomial"
## 
## $solver
## [1] "COORDINATE_DESCENT"
## 
## $alpha
## [1] 0.0 0.2 0.4 0.6 0.8 1.0
## 
## $lambda
##  [1] 6.2463 4.5467 3.3095 2.4090 1.7535 1.2763 0.9290 0.6762 0.4922 0.3583
## [11] 0.2608 0.1898 0.1382 0.1006 0.0732 0.0533 0.0388 0.0282 0.0206 0.0150
## [21] 0.0109 0.0079 0.0058 0.0042 0.0031 0.0022 0.0016
## 
## $lambda_search
## [1] TRUE
## 
## $nlambdas
## [1] 30
## 
## $max_iterations
## [1] 300
## 
## $objective_epsilon
## [1] 0.0001
## 
## $gradient_epsilon
## [1] 0.000001
## 
## $link
## [1] "logit"
## 
## $lambda_min_ratio
## [1] 0.0001
## 
## $max_active_predictors
## [1] 5000
## 
## $obj_reg
## [1] 0.00094
## 
## $max_runtime_secs
## [1] 4.3
## 
## $x
##  [1] "JobRole"                  "EducationField"          
##  [3] "JobLevel"                 "Education"               
##  [5] "EnvironmentSatisfaction"  "JobInvolvement"          
##  [7] "JobSatisfaction"          "RelationshipSatisfaction"
##  [9] "StockOptionLevel"         "WorkLifeBalance"         
## [11] "Department"               "BusinessTravel"          
## [13] "MaritalStatus"            "Gender"                  
## [15] "OverTime"                 "PerformanceRating"       
## [17] "Age"                      "DailyRate"               
## [19] "DistanceFromHome"         "EmployeeNumber"          
## [21] "HourlyRate"               "MonthlyIncome"           
## [23] "MonthlyRate"              "NumCompaniesWorked"      
## [25] "PercentSalaryHike"        "TotalWorkingYears"       
## [27] "TrainingTimesLastYear"    "YearsAtCompany"          
## [29] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [31] "YearsWithCurrManager"    
## 
## $y
## [1] "Attrition"

You can then select a parameter of interest:

## [1] 5

H2O provides complete access to the model parameters so you can recreate the model or tweak model parameters easily.

8.2.3 Cross Validation (K-Fold CV)

H2O Documentation

nfolds is K-fold validation. K-fold validation is used in model development to determine what parameters affect prediction performance by performing multiple sub-tests rather than just one test. The end result is parameter selections based on model performance stability that generalizes better with new data.

For example, for nfolds=5, 6 models are built. The first 5 models (cross-validation models) are built on 80% of the training data, and a different 20% is held out for each of the 5 models. Then the main model is built on 100% of the training data. This main model is the model you get back from H2O in R, Python and Flow (though the CV models are also stored and available to access later).

All 5 cross-validation models contain training metrics (from the 80% training data) and validation metrics (from their 20% holdout/validation data). To compute their individual validation metrics, each of the 5 cross-validation models had to make predictions on their 20% of of rows of the original training frame, and score against the true labels of the 20% holdout.

For the main model, this is how the cross-validation metrics are computed: The 5 holdout predictions are combined into one prediction for the full training dataset (i.e., predictions for every row of the training data, but the model making the prediction for a particular row has not seen that row during training). This “holdout prediction” is then scored against the true labels, and the overall cross-validation metrics are computed.

Each cv-model produces a prediction frame pertaining to its fold. It can be saved and probed from the various clients if keep_cross_validation_predictions parameter is set in the model constructor.

These holdout predictions have some interesting properties. First they have names like:

prediction_GBM_model_1452035702801_1_cv_1

and they contain, unsurprisingly, predictions for the data held out in the fold. They also have the same number of rows as the entire input training frame with 0s filled in for all rows that are not in the hold out.

Retrieve the AUC for a classifier with the xval argument to retrieve the average cross validation AUC:

## train valid  xval 
##  0.90  0.81  0.85

8.2.4 Grid/Hyperparameter Search

H2O Documetnation

H2O supports two types of grid search – traditional (or “cartesian”) grid search and random grid search. In a cartesian grid search, users specify a set of values for each hyperparameter that they want to search over, and H2O will train a model for every combination of the hyperparameter values. This means that if you have three hyperparameters and you specify 5, 10 and 2 values for each, your grid will contain a total of 5102 = 100 models.

In random grid search, the user specifies the hyperparameter space in the exact same way, except H2O will sample uniformly from the set of all possible hyperparameter value combinations. In random grid search, the user also specifies a stopping criterion, which controls when the random grid search is completed. The user can tell the random grid search to stop by specifying a maximum number of models or the maximum number of seconds allowed for the search. The user may also specify a performance-metric-based stopping criterion, which will stop the random grid search when the performance stops improving by a specified amount.

Once the grid search is complete, the user can query the grid object and sort the models by a particular performance metric (for example, “AUC”). All models are stored in the H2O cluster and are accessible by model id.

Grid Search

Note the use of h2o.grid and hyper_params

Random Grid Search

Note the use of seq, search_criteria: search_criteria <- list(strategy = "RandomDiscrete", max_models = 36, seed = 1) and hyper_parameters

Specify the RandomDiscrete strategy to perform a random search of all the combinations of your hyperparameters. RandomDiscrete should be combined with at least one early stopping criterion, max_models and/or max_runtime_secs

Examples:

To learn everything about this topic, must visit this tutorial

H2O deep learning model sets nfolds = 0 by default!

8.3 Visualizing The Leaderboard

Develop code to improve the output of H2O leaderboard:

##                                                model_id  auc logloss
## 1             GLM_grid_0_AutoML_20181028_212737_model_0 0.85    0.32
## 2 StackedEnsemble_BestOfFamily_0_AutoML_20181028_212737 0.84    0.32
## 3    StackedEnsemble_AllModels_0_AutoML_20181028_212737 0.84    0.32
## 4             GBM_grid_0_AutoML_20181028_212737_model_4 0.82    0.34
## 5             GBM_grid_0_AutoML_20181028_212737_model_5 0.80    0.36
## 6             GBM_grid_0_AutoML_20181028_212737_model_1 0.80    0.36
##   mean_per_class_error rmse   mse
## 1                 0.24 0.31 0.094
## 2                 0.23 0.31 0.093
## 3                 0.24 0.30 0.093
## 4                 0.25 0.32 0.103
## 5                 0.28 0.33 0.109
## 6                 0.27 0.33 0.107
## 
## [21 rows x 6 columns]

AUC is a way of measuring the performance of a binary classifier by comparing the false positive rate to the true positive rate. Logloss or logarithmic loss measures the performance of a classifier by comparing the class probability to actual value (1 or 0)

The AUC and Logloss metrics may suggest different models.

8.3.1 Custom Function: plot_h2o_leaderboard

  • str_split simplify makes the output a matrix rather that a list. Makes it easier to extract data [,1]. This allows us to get the type of model from each model ID in the leaderboard.
  • rownames_to_column adds row names of a data frame to a column. Gives us a column with the row name as a character. In this case it is very help because it identifies the order in which the models appear in the leaderboard.
  • Note the use of as_factor and as.factor. 'as_factor allows us to order the data by AUC. Recall as.factor would order alphabetically.
  • Use gather to take advantage of ggplot color and group aesthetics. Data must be in long format. Long format has each faceting values stacked on top of each other making the data frame long rather than wide. - the key column will have the valuesaucandlogloss. Thevaluecolumn will have the values ofaucandlogloss. gather transposes auc and logloss from two columns to one - key and their associated values will be in the value column - Before:
## # A tibble: 10 x 8
##    rowname model_id   auc logloss mean_per_class_~  rmse    mse model_type
##    <chr>   <fct>    <dbl>   <dbl>            <dbl> <dbl>  <dbl> <fct>     
##  1 1       GLM_gri~ 0.846   0.321            0.235 0.307 0.0940 GLM       
##  2 2       Stacked~ 0.840   0.316            0.232 0.306 0.0934 StackedEn~
##  3 3       Stacked~ 0.839   0.316            0.236 0.305 0.0929 StackedEn~
##  4 4       GBM_gri~ 0.816   0.345            0.254 0.320 0.103  GBM       
##  5 5       GBM_gri~ 0.800   0.359            0.285 0.330 0.109  GBM       
##  6 6       GBM_gri~ 0.797   0.362            0.273 0.327 0.107  GBM       
##  7 7       GBM_gri~ 0.795   0.351            0.259 0.322 0.104  GBM       
##  8 8       GBM_gri~ 0.783   0.372            0.293 0.332 0.110  GBM       
##  9 9       GBM_gri~ 0.783   0.379            0.288 0.337 0.113  GBM       
## 10 10      GBM_gri~ 0.781   0.405            0.297 0.353 0.125  GBM
  • After:
## # A tibble: 50 x 5
##    rowname model_id                              model_type    key   value
##    <chr>   <fct>                                 <fct>         <fct> <dbl>
##  1 1       GLM_grid_0_AutoML_20181028_212737_mo~ GLM           auc   0.846
##  2 2       StackedEnsemble_BestOfFamily_0_AutoM~ StackedEnsem~ auc   0.840
##  3 3       StackedEnsemble_AllModels_0_AutoML_2~ StackedEnsem~ auc   0.839
##  4 4       GBM_grid_0_AutoML_20181028_212737_mo~ GBM           auc   0.816
##  5 5       GBM_grid_0_AutoML_20181028_212737_mo~ GBM           auc   0.800
##  6 6       GBM_grid_0_AutoML_20181028_212737_mo~ GBM           auc   0.797
##  7 7       GBM_grid_0_AutoML_20181028_212737_mo~ GBM           auc   0.795
##  8 8       GBM_grid_0_AutoML_20181028_212737_mo~ GBM           auc   0.783
##  9 9       GBM_grid_0_AutoML_20181028_212737_mo~ GBM           auc   0.783
## 10 10      GBM_grid_0_AutoML_20181028_212737_mo~ GBM           auc   0.781
## # ... with 40 more rows
  • factor_rev() reverses the factor levels.
    • Without fct_rev:

  • With fct_rev:

  • scales in the ggplot: scales can be fixed (“fixed”, the default), free (“free”), or free in one dimension (“free_x”, “free_y”)

The course example used a deep learning model where only auc and logloss were available metrics. Other models may return those plus mean_per_class_error, rmse, and mse.

# Visualize the H2O leaderboard to help with model selection
plot_h2o_leaderboard <- function(h2o_leaderboard, order_by = c("auc", "logloss"), 
                                 n_max = 20, size = 4, include_lbl = TRUE) {
     # Setup inputs
     order_by <- tolower(order_by[[1]])
     leaderboard_tbl <- h2o_leaderboard %>% as.tibble() %>%
          mutate(model_type = str_split(model_id, "_", simplify = T) %>% .[,1]) %>%
          rownames_to_column(var = "rowname") %>%
          mutate(model_id = paste0(rowname, ". ", as.character(model_id)) %>% as.factor())
     
     # Transformation
     if (order_by == "auc") {
          data_transformed_tbl <- leaderboard_tbl %>% slice(1:n_max) %>%
               mutate(model_id = as_factor(model_id) %>% reorder(auc),
                      model_type = as.factor(model_type)) %>%
               gather(key = key, value = value, 
                      -c(model_id, model_type, rowname), factor_key = T)
          
     } else if (order_by == "logloss") {
          data_transformed_tbl <- leaderboard_tbl %>% slice(1:n_max) %>%
               mutate(model_id = as_factor(model_id) %>% reorder(logloss) %>% fct_rev(),
                    model_type = as.factor(model_type)) %>%
               gather(key = key, value = value, -c(model_id, model_type, rowname), factor_key = T)
          
     } else {stop(paste0("order_by = '", order_by, "' is not a permitted option."))}
     
     # Visualization
     g <- data_transformed_tbl %>%
          ggplot(aes(value, model_id, color = model_type)) + geom_point(size = size) +
          facet_wrap(~ key, scales = "free_x") + theme_tq() + scale_color_tq() +
          labs(title = "Leaderboard Metrics", subtitle = paste0("Ordered by: ", toupper(order_by)),
               y = "Model Position, Model ID", x = "")
     
     if (include_lbl) g <- g + geom_label(aes(label = round(value, 2), hjust = "inward"))
     return(g)}

The number of facets changes with the model type. Deep Learning models might provide AUC and Logloss. Other models may return rmse, mse, etc.

9 Modeling Churn: Assessing H2O Performance

In this chapter, AUC, Logloss, Precision - Recall Plots, ROC Plots, Gain and Lift Charts.

A consolidated plot will also be created - one that is suitable for executive presentation.

9.1 H2O Performance For Binary Classification

Create an H2O performance object. Could use h2o.loadModel() but it is simpler just to reuse the current leader. Also, lets get a deep learning model too.

h2o.performance() makes an H2O performance object by providing a model and new data.

## H2OBinomialMetrics: glm
## 
## MSE:  0.089
## RMSE:  0.3
## LogLoss:  0.3
## Mean Per-Class Error:  0.22
## AUC:  0.87
## Gini:  0.74
## R^2:  0.35
## Residual Deviance:  131
## AIC:  291
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error     Rate
## No     171  13 0.070652  =13/184
## Yes     13  23 0.361111   =13/36
## Totals 184  36 0.118182  =26/220
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.361471 0.638889  35
## 2                       max f2  0.305865 0.680628  46
## 3                 max f0point5  0.556700 0.714286  18
## 4                 max accuracy  0.556700 0.895455  18
## 5                max precision  0.907863 1.000000   0
## 6                   max recall  0.040203 1.000000 145
## 7              max specificity  0.907863 1.000000   0
## 8             max absolute_mcc  0.361471 0.568237  35
## 9   max min_per_class_accuracy  0.196433 0.777174  68
## 10 max mean_per_class_accuracy  0.305865 0.804046  46
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

The performance object is verbose!

## $model
## $model$`__meta`
## $model$`__meta`$schema_version
## [1] 3
## 
## $model$`__meta`$schema_name
## [1] "ModelKeyV3"
## 
## $model$`__meta`$schema_type
## [1] "Key<Model>"
## 
## 
## $model$name
## [1] "GLM_grid_0_AutoML_20181028_212737_model_0"
## 
## $model$type
## [1] "Key<Model>"
## 
## $model$URL
## [1] "/3/Models/GLM_grid_0_AutoML_20181028_212737_model_0"
## 
## 
## $model_checksum
## [1] 1662347448424290560
## 
## $frame
## $frame$name
## [1] "test_tbl"
## 
## 
## $frame_checksum
## [1] -54403433091472432
## 
## $description
## NULL
## 
## $scoring_time
## [1] 1540776501447
## 
## $predictions
## NULL
## 
## $MSE
## [1] 0.089
## 
## $RMSE
## [1] 0.3
## 
## $nobs
## [1] 220
## 
## $custom_metric_name
## NULL
## 
## $custom_metric_value
## [1] 0
## 
## $r2
## [1] 0.35
## 
## $logloss
## [1] 0.3
## 
## $AUC
## [1] 0.87
## 
## $pr_auc
## [1] 0.66
## 
## $Gini
## [1] 0.74
## 
## $mean_per_class_error
## [1] 0.22
## 
## $domain
## [1] "No"  "Yes"
## 
## $cm
## $cm$`__meta`
## $cm$`__meta`$schema_version
## [1] 3
## 
## $cm$`__meta`$schema_name
## [1] "ConfusionMatrixV3"
## 
## $cm$`__meta`$schema_type
## [1] "ConfusionMatrix"
## 
## 
## $cm$table
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##         No Yes  Error       Rate
## No     171  13 0.0707 = 13 / 184
## Yes     13  23 0.3611 =  13 / 36
## Totals 184  36 0.1182 = 26 / 220
## 
## 
## $thresholds_and_metric_scores
## Metrics for Thresholds: Binomial metrics as a function of classification thresholds
##   threshold       f1       f2 f0point5 accuracy precision   recall
## 1  0.907863 0.054054 0.034483 0.125000 0.840909  1.000000 0.027778
## 2  0.815811 0.105263 0.068493 0.227273 0.845455  1.000000 0.055556
## 3  0.801944 0.153846 0.102041 0.312500 0.850000  1.000000 0.083333
## 4  0.769743 0.200000 0.135135 0.384615 0.854545  1.000000 0.111111
## 5  0.720564 0.243902 0.167785 0.446429 0.859091  1.000000 0.138889
##   specificity absolute_mcc min_per_class_accuracy mean_per_class_accuracy
## 1    1.000000     0.152769               0.027778                0.513889
## 2    1.000000     0.216543               0.055556                0.527778
## 3    1.000000     0.265820               0.083333                0.541667
## 4    1.000000     0.307653               0.111111                0.555556
## 5    1.000000     0.344765               0.138889                0.569444
##   tns fns fps tps      tnr      fnr      fpr      tpr idx
## 1 184  35   0   1 1.000000 0.972222 0.000000 0.027778   0
## 2 184  34   0   2 1.000000 0.944444 0.000000 0.055556   1
## 3 184  33   0   3 1.000000 0.916667 0.000000 0.083333   2
## 4 184  32   0   4 1.000000 0.888889 0.000000 0.111111   3
## 5 184  31   0   5 1.000000 0.861111 0.000000 0.138889   4
## 
## ---
##     threshold       f1       f2 f0point5 accuracy precision   recall
## 215  0.001598 0.286853 0.501393 0.200893 0.186364  0.167442 1.000000
## 216  0.001287 0.285714 0.500000 0.200000 0.181818  0.166667 1.000000
## 217  0.001130 0.284585 0.498615 0.199115 0.177273  0.165899 1.000000
## 218  0.000965 0.283465 0.497238 0.198238 0.172727  0.165138 1.000000
## 219  0.000884 0.282353 0.495868 0.197368 0.168182  0.164384 1.000000
## 220  0.000824 0.281250 0.494505 0.196507 0.163636  0.163636 1.000000
##     specificity absolute_mcc min_per_class_accuracy
## 215    0.027174     0.067454               0.027174
## 216    0.021739     0.060193               0.021739
## 217    0.016304     0.052008               0.016304
## 218    0.010870     0.042367               0.010870
## 219    0.005435     0.029890               0.005435
## 220    0.000000     0.000000               0.000000
##     mean_per_class_accuracy tns fns fps tps      tnr      fnr      fpr
## 215                0.513587   5   0 179  36 0.027174 0.000000 0.972826
## 216                0.510870   4   0 180  36 0.021739 0.000000 0.978261
## 217                0.508152   3   0 181  36 0.016304 0.000000 0.983696
## 218                0.505435   2   0 182  36 0.010870 0.000000 0.989130
## 219                0.502717   1   0 183  36 0.005435 0.000000 0.994565
## 220                0.500000   0   0 184  36 0.000000 0.000000 1.000000
##          tpr idx
## 215 1.000000 214
## 216 1.000000 215
## 217 1.000000 216
## 218 1.000000 217
## 219 1.000000 218
## 220 1.000000 219
## 
## $max_criteria_and_metric_scores
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.361471 0.638889  35
## 2                       max f2  0.305865 0.680628  46
## 3                 max f0point5  0.556700 0.714286  18
## 4                 max accuracy  0.556700 0.895455  18
## 5                max precision  0.907863 1.000000   0
## 6                   max recall  0.040203 1.000000 145
## 7              max specificity  0.907863 1.000000   0
## 8             max absolute_mcc  0.361471 0.568237  35
## 9   max min_per_class_accuracy  0.196433 0.777174  68
## 10 max mean_per_class_accuracy  0.305865 0.804046  46
## 
## $gains_lift_table
## Gains/Lift Table: Avg response rate: 16.36 %, avg score: 17.06 %
##    group cumulative_data_fraction lower_threshold     lift cumulative_lift
## 1      1               0.01363636        0.795825 6.111111        6.111111
## 2      2               0.02272727        0.698406 6.111111        6.111111
## 3      3               0.03181818        0.657941 6.111111        6.111111
## 4      4               0.04090909        0.631589 6.111111        6.111111
## 5      5               0.05000000        0.610115 3.055556        5.555556
## 6      6               0.10000000        0.493094 3.888889        4.722222
## 7      7               0.15000000        0.373599 2.222222        3.888889
## 8      8               0.20000000        0.314707 2.222222        3.472222
## 9      9               0.30000000        0.200597 0.555556        2.500000
## 10    10               0.40000000        0.128708 0.833333        2.083333
## 11    11               0.50000000        0.087737 0.555556        1.777778
## 12    12               0.60000000        0.055181 0.277778        1.527778
## 13    13               0.70000000        0.031811 0.833333        1.428571
## 14    14               0.80000000        0.017333 0.000000        1.250000
## 15    15               0.90000000        0.008458 0.000000        1.111111
## 16    16               1.00000000        0.000824 0.000000        1.000000
##    response_rate    score cumulative_response_rate cumulative_score
## 1       1.000000 0.841873                 1.000000         0.841873
## 2       1.000000 0.745154                 1.000000         0.803185
## 3       1.000000 0.662249                 1.000000         0.762918
## 4       1.000000 0.644891                 1.000000         0.736689
## 5       0.500000 0.625222                 0.909091         0.716423
## 6       0.636364 0.569948                 0.772727         0.643185
## 7       0.363636 0.425287                 0.636364         0.570552
## 8       0.363636 0.343600                 0.568182         0.513814
## 9       0.090909 0.256735                 0.409091         0.428121
## 10      0.136364 0.163885                 0.340909         0.362062
## 11      0.090909 0.103809                 0.290909         0.310412
## 12      0.045455 0.071676                 0.250000         0.270622
## 13      0.136364 0.041599                 0.233766         0.237905
## 14      0.000000 0.024287                 0.204545         0.211202
## 15      0.000000 0.012672                 0.181818         0.189144
## 16      0.000000 0.004134                 0.163636         0.170643
##    capture_rate cumulative_capture_rate        gain cumulative_gain
## 1      0.083333                0.083333  511.111111      511.111111
## 2      0.055556                0.138889  511.111111      511.111111
## 3      0.055556                0.194444  511.111111      511.111111
## 4      0.055556                0.250000  511.111111      511.111111
## 5      0.027778                0.277778  205.555556      455.555556
## 6      0.194444                0.472222  288.888889      372.222222
## 7      0.111111                0.583333  122.222222      288.888889
## 8      0.111111                0.694444  122.222222      247.222222
## 9      0.055556                0.750000  -44.444444      150.000000
## 10     0.083333                0.833333  -16.666667      108.333333
## 11     0.055556                0.888889  -44.444444       77.777778
## 12     0.027778                0.916667  -72.222222       52.777778
## 13     0.083333                1.000000  -16.666667       42.857143
## 14     0.000000                1.000000 -100.000000       25.000000
## 15     0.000000                1.000000 -100.000000       11.111111
## 16     0.000000                1.000000 -100.000000        0.000000
## 
## $residual_deviance
## [1] 131
## 
## $null_deviance
## [1] 196
## 
## $AIC
## [1] 291
## 
## $null_degrees_of_freedom
## [1] 219
## 
## $residual_degrees_of_freedom
## [1] 140

Accessing the information from the performance object is simple!

Some metrics are directly available like auc: Area Under the Curve, referring to the Reciever Operating Characteristics (ROC) plot that measures the true positive rate (TPR) and the false positive rate (FPR)

## [1] 0.87

`h2o.auc() provides options for returning values for train, validation, and cross validation. However, these options are valid only for models, not performance objects.

This works:

## train valid  xval 
##  0.90  0.81  0.85

This does not return multiple values:

## [1] 0.87

9.1.1 Gini Coefficient:

\(AUC = (GiniCoeff + 1)/2)\) | \(Gini=(1-AUC)*2\)

The Gini Coefficient is a popular metric on Kaggle, especially for imbalanced class values.

The Gini coefficient is one of many to measure divergence, i.e. the difference in the distribution of two variables in a data set. Values range from -1 to 1 to indicate total divergence one direction or the other, and 0 percent representing no divergence.

## [1] 0.74
## train valid  xval 
##  0.80  0.62  0.69

9.1.2 Logloss

Logloss is a metric that measures the class probability from the model against the actual value in binary format (0/1). See this for more information.

Log Loss quantifies the accuracy of a classifier by penalizing false classifications. Minimizing the Log Loss is basically equivalent to maximizing the accuracy of the classifier. A perfect classifier would have a Log Loss of precisely zero. Less ideal classifiers have progressively larger values of Log Loss.

Log Loss heavily penalizes classifiers that are confident about an incorrect classification. For example, if for a particular observation, the classifier assigns a very small probability to the correct class then the corresponding contribution to the Log Loss will be very large indeed. Naturally this is going to have a significant impact on the overall Log Loss for the classifier. The bottom line is that it’s better to be somewhat wrong than emphatically wrong. Of course it’s always better to be completely right, but that is seldom achievable in practice! There are at least two approaches to dealing with poor classifications:

  1. Examine the problematic observations relative to the full data set. Are they simply outliers? In this case, remove them from the data and re-train the classifier.
  2. Consider smoothing the predicted probabilities using, for example, Laplace Smoothing. This will result in a less “certain” classifier and might improve the overall Log Loss.
## [1] 0.3
## train valid  xval 
##  0.27  0.30  0.32

9.1.3 Confusion Matrix

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.361470825344551:
##         No Yes    Error     Rate
## No     171  13 0.070652  =13/184
## Yes     13  23 0.361111   =13/36
## Totals 184  36 0.118182  =26/220
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.348527593514236:
##         No Yes    Error       Rate
## No     838  51 0.057368    =51/889
## Yes     58 118 0.329545    =58/176
## Totals 896 169 0.102347  =109/1065

Must understand threshold, precision and recall:

  • Top Row - Predictions; Left Column - Actual
  • Threshold is the value that determines which class probability is a 1 or 0
    • Example: if a 1 is assigned if the probability is above the threshold value - which is a probability
  • F1 is the optimal balance between precision and recall. The threshold maximizes F1. However, this is not always the best case. An expected value optimization is required when the costs of false positives and false negatives are known.
    • F1 = ( * (precision * recall))/ (precision + recall) - hence balances precision & recall
  • PRECISION - Measures false positives (eg, predicted to leave but actually stay) Precision = TP/(TP + FP) : It detects how frequently the algo over-picks the YES class
  • RECALL - Measures false negatives (eg, predicted to stay but actually leave)
    • Recall = TP/(TP + FN) - It provides a metric for under-picking YES
  • Because there are different costs associated with FN and FP; FN often cost a company more.
    This is where Expected Value (EV) comes into play - this discussion is coming.

9.2 H2O Metrics: Analyze Confusion Matrix

Review the output from the confusion matrix.

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.348527593514236:
##         No Yes    Error       Rate
## No     838  51 0.057368    =51/889
## Yes     58 118 0.329545    =58/176
## Totals 896 169 0.102347  =109/1065

How does the threshold max f1 affect the models? Use the h2o.metric. Make the metric easier to read by turning it into a tibble.

## # A tibble: 220 x 20
##    threshold     f1     f2 f0point5 accuracy precision recall specificity
##        <dbl>  <dbl>  <dbl>    <dbl>    <dbl>     <dbl>  <dbl>       <dbl>
##  1     0.908 0.0541 0.0345    0.125    0.841         1 0.0278           1
##  2     0.816 0.105  0.0685    0.227    0.845         1 0.0556           1
##  3     0.802 0.154  0.102     0.312    0.85          1 0.0833           1
##  4     0.770 0.20   0.135     0.385    0.855         1 0.111            1
##  5     0.721 0.244  0.168     0.446    0.859         1 0.139            1
##  6     0.662 0.286  0.200     0.5      0.864         1 0.167            1
##  7     0.662 0.326  0.232     0.547    0.868         1 0.194            1
##  8     0.655 0.364  0.263     0.588    0.873         1 0.222            1
##  9     0.635 0.4    0.294     0.625    0.877         1 0.25             1
## 10     0.630 0.435  0.325     0.658    0.882         1 0.278            1
## # ... with 210 more rows, and 12 more variables: absolute_mcc <dbl>,
## #   min_per_class_accuracy <dbl>, mean_per_class_accuracy <dbl>,
## #   tns <dbl>, fns <dbl>, fps <dbl>, tps <dbl>, tnr <dbl>, fnr <dbl>,
## #   fpr <dbl>, tpr <dbl>, idx <int>

9.3 Precision, Recall, F1 & Effect Of Threshold

Precision

\[Precision = TP/(TP + FP)\]

In other words, Precision detects how frequently we over-pick the Yes class. In the confusion matrix above, 51 records were over picked - these employees were predicted to leave but did not. (No real harm in this - the company did not lose money.)

## [1] 0.7

This is the percent of time the algorithm predicted someone would leave correctly.

Recall

\[Recall = TP/(FN + TP)\] Recall provides a metric for under-picking Yes class.

## [1] 0.67

This measures how often will miss people that leave by predicting they will stay.

Recall is typically more important than Precision because you usually would rather give up some false positives (inadvertently target stayers) to gain false negatives (accurately predict quitters).

F1

F1 balances precision and recall.

\[F1 = (2 * (Precision * Recall))/(Precision + Recall)\]

## [1] 0.68

The threshold is the limit that maximizes F1 across all possibilities.

Threshold{#precisionrecallplot}

Note the use of a new H~2~O function called h2o.find_threshold_by_max_metric.

9.4 Performance Charts For Data Scientists

To compare models, we need to load a few more models. A Deep Model has already been saved. Lets save one of the Stacked Models.

##                                                model_id  auc logloss
## 1             GLM_grid_0_AutoML_20181028_212737_model_0 0.85    0.32
## 2 StackedEnsemble_BestOfFamily_0_AutoML_20181028_212737 0.84    0.32
## 3    StackedEnsemble_AllModels_0_AutoML_20181028_212737 0.84    0.32
## 4             GBM_grid_0_AutoML_20181028_212737_model_4 0.82    0.34
## 5             GBM_grid_0_AutoML_20181028_212737_model_5 0.80    0.36
## 6             GBM_grid_0_AutoML_20181028_212737_model_1 0.80    0.36
##   mean_per_class_error rmse   mse
## 1                 0.24 0.31 0.094
## 2                 0.23 0.31 0.093
## 3                 0.24 0.30 0.093
## 4                 0.25 0.32 0.103
## 5                 0.28 0.33 0.109
## 6                 0.27 0.33 0.107
## 
## [21 rows x 6 columns]

Next collect the performance metrics for the Deep Learning and Stacked models. Develop a new functions: load_model_performance_metrics and Model_Metrics.

9.5 Performance Charts For Business People

Gain and Lift charts are used to evaluate performance of classification model. They measure how much better one can expect to do with the predictive model comparing without a model. It’s a very popular metrics in marketing analytics. It’s not just restricted to marketing analysis. It can be used in other domains as well such as risk modeling, supply chain analytics etc. It also helps to find the best predictive model among multiple challenger models.

See https://www.listendata.com/2014/08/excel-template-gain-and-lift-charts.html - seems to follow the same process as the one below. The goal is to make a Gain-Lift Table:

  1. Below is a score file in descending order by estimated probability. By ranking by class probability of Yes, we assess the models ability to detect someone that is leaving.
  • Note the use of bind_cols which is part of dplyr
## # A tibble: 220 x 4
##    predict     No   Yes Attrition
##    <fct>    <dbl> <dbl> <fct>    
##  1 Yes     0.0921 0.908 Yes      
##  2 Yes     0.184  0.816 Yes      
##  3 Yes     0.198  0.802 Yes      
##  4 Yes     0.230  0.770 Yes      
##  5 Yes     0.279  0.721 Yes      
##  6 Yes     0.338  0.662 Yes      
##  7 Yes     0.338  0.662 Yes      
##  8 Yes     0.345  0.655 Yes      
##  9 Yes     0.365  0.635 Yes      
## 10 Yes     0.370  0.630 Yes      
## # ... with 210 more rows

9.5.1 Gain & Lift Tables

Steps:

  1. Split the ranked file into 10 sections (deciles). Grouping into cohorts of most likely groups to least likely groups is at the heart of gain/lift. When we do this, we can immediately show that if a candidate has a high probability of leaving, how likely they are to leave.

  2. Number of observations in each decile: group = row_number()
  3. Number of actual events in each decile: responses = sum(Attrition == "Yes")

## # A tibble: 10 x 3
##    ntile cases responses
##    <int> <int>     <int>
##  1    10    22        17
##  2     9    22         8
##  3     8    22         2
##  4     7    22         3
##  5     6    22         2
##  6     5    22         1
##  7     4    22         3
##  8     3    22         0
##  9     2    22         0
## 10     1    22         0

In ntile = 10, 17 people actually left out of 22 employees! 8 of 22 actually left in the next ntile.

  • note the use of ntile also from dplyr. We simply group records into 10 groups. Note the 10th group has the highest values for Yes. Makes the continuous variable Yes into a discrete variable using the new groups called ntile.
  • dpplyr::row_numer is useful for calculations to simply make the row numbers of a tibble into a column
## # A tibble: 10 x 4
##    ntile cases responses group
##    <int> <int>     <int> <int>
##  1    10    22        17     1
##  2     9    22         8     2
##  3     8    22         2     3
##  4     7    22         3     4
##  5     6    22         2     5
##  6     5    22         1     6
##  7     4    22         3     7
##  8     3    22         0     8
##  9     2    22         0     9
## 10     1    22         0    10

Now group - the row numbers - now help. The first group has the highest number of employees that actually left. No longer the the ntile column.

Time to calculate the gain and lift.

  1. Cumulative responses: Number of cumulative actual events in each decile: cumulative_responses = cumsum(responses)
  2. Percentage responses: pct_responses = responses / sum(responses)
  3. Cumulative gain: Cumulatively sum the pct_responses. gain = cumsum(pct_responses)
    • Think of this as what we gained by using the model. If we focused on the first 2 groups, we gain the ability to target ~70% of our quitters using the model. Remember, this is measuring employees that actual left and the model is identifying 70% of them!
    • gain is really the cumulative_capture_rate
  4. A calculation is needed before we can measure lift. cumulative_pct_cases = cumsum(cases)/sum(cases). This is a simple calculation. Since we have 10 groups, this calculation will start at 0.100 and end with 1.00 step by 0.100 so: .1, .2, 3. . .
  5. Calculate lift. Think lift as a multiplier between what we gained divided by what we expected to gain with no model. For example, if we focused on the first 2 decile groups, we gained the ability to target ~70% of our quitter but we expected to be able to target only 20% (2 deciles (cummulative_pct_cases) = 20%)
    • lift is just another way to show the gain
    • Below, lift = 4.7 simply means you are 4.7X more likely to identify quitters using the model
## Observations: 10
## Variables: 10
## $ group                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
## $ cases                <int> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22
## $ responses            <int> 17, 8, 2, 3, 2, 1, 3, 0, 0, 0
## $ cumulative_responses <int> 17, 25, 27, 30, 32, 33, 36, 36, 36, 36
## $ pct_responses        <dbl> 0.472, 0.222, 0.056, 0.083, 0.056, 0.028,...
## $ gain                 <dbl> 0.47, 0.69, 0.75, 0.83, 0.89, 0.92, 1.00,...
## $ cumulative_pct_cases <dbl> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0...
## $ lift                 <dbl> 4.7, 3.5, 2.5, 2.1, 1.8, 1.5, 1.4, 1.2, 1...
## $ gain_baseline        <dbl> 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0...
## $ lift_baseline        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

Above, uncertain why gain_baseline and lift_baseline were added. (BTW, `lift_baseline is always = 1.)

9.5.2 h2o:gainsLift()

h2o.gainsLift streamline the manual gain-lift code above. h2o.performance() is the starting point for the H2O fain and lift function.

## Observations: 16
## Variables: 13
## $ group                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12...
## $ cumulative_data_fraction <dbl> 0.014, 0.023, 0.032, 0.041, 0.050, 0....
## $ lower_threshold          <dbl> 0.79583, 0.69841, 0.65794, 0.63159, 0...
## $ lift                     <dbl> 6.11, 6.11, 6.11, 6.11, 3.06, 3.89, 2...
## $ cumulative_lift          <dbl> 6.1, 6.1, 6.1, 6.1, 5.6, 4.7, 3.9, 3....
## $ response_rate            <dbl> 1.000, 1.000, 1.000, 1.000, 0.500, 0....
## $ score                    <dbl> 0.8419, 0.7452, 0.6622, 0.6449, 0.625...
## $ cumulative_response_rate <dbl> 1.00, 1.00, 1.00, 1.00, 0.91, 0.77, 0...
## $ cumulative_score         <dbl> 0.84, 0.80, 0.76, 0.74, 0.72, 0.64, 0...
## $ capture_rate             <dbl> 0.083, 0.056, 0.056, 0.056, 0.028, 0....
## $ cumulative_capture_rate  <dbl> 0.083, 0.139, 0.194, 0.250, 0.278, 0....
## $ gain                     <dbl> 511, 511, 511, 511, 206, 289, 122, 12...
## $ cumulative_gain          <dbl> 511, 511, 511, 511, 456, 372, 289, 24...
  • group = ntiles by 16: 220 leave /16 ntiles = 13.75 employees/group (manual calculation set ntile = 10)

Only a subset of the H2O data is needed:

## # A tibble: 16 x 4
##    group cumulative_data_fraction cumulative_capture_rate cumulative_lift
##    <int>                    <dbl>                   <dbl>           <dbl>
##  1     1                   0.0136                  0.0833            6.11
##  2     2                   0.0227                  0.139             6.11
##  3     3                   0.0318                  0.194             6.11
##  4     4                   0.0409                  0.25              6.11
##  5     5                   0.05                    0.278             5.56
##  6     6                   0.1                     0.472             4.72
##  7     7                   0.15                    0.583             3.89
##  8     8                   0.2                     0.694             3.47
##  9     9                   0.3                     0.75              2.5 
## 10    10                   0.4                     0.833             2.08
## 11    11                   0.5                     0.889             1.78
## 12    12                   0.6                     0.917             1.53
## 13    13                   0.7                     1                 1.43
## 14    14                   0.8                     1                 1.25
## 15    15                   0.9                     1                 1.11
## 16    16                   1                       1                 1

This is all that is needed to create out gain and lift plot.

9.5.3 Gain Plot

  • Use gather() to stack the data to use ggplot color, group aesthetics or facet_wrap
  • for the gain data, cumumalitive_lift is not needed (but it is later for lift)
## # A tibble: 32 x 4
##    group cumulative_data_fraction key    value
##    <int>                    <dbl> <chr>  <dbl>
##  1     1                   0.0136 gain  0.0833
##  2     2                   0.0227 gain  0.139 
##  3     3                   0.0318 gain  0.194 
##  4     4                   0.0409 gain  0.25  
##  5     5                   0.05   gain  0.278 
##  6     6                   0.1    gain  0.472 
##  7     7                   0.15   gain  0.583 
##  8     8                   0.2    gain  0.694 
##  9     9                   0.3    gain  0.75  
## 10    10                   0.4    gain  0.833 
## # ... with 22 more rows

9.5.4 Lift Chart

  • Do not need cumulative_capture_rate
  • lift-Baseline always = 1
## # A tibble: 32 x 4
##    group cumulative_data_fraction key   value
##    <int>                    <dbl> <chr> <dbl>
##  1     1                   0.0136 lift   6.11
##  2     2                   0.0227 lift   6.11
##  3     3                   0.0318 lift   6.11
##  4     4                   0.0409 lift   6.11
##  5     5                   0.05   lift   5.56
##  6     6                   0.1    lift   4.72
##  7     7                   0.15   lift   3.89
##  8     8                   0.2    lift   3.47
##  9     9                   0.3    lift   2.5 
## 10    10                   0.4    lift   2.08
## # ... with 22 more rows

9.5.5 Explain Gain/Lift Plots

NOTE: The plots used for explanation differ from the calculated plots in the RMD slightly.

Below, doing nothing - not using the model - 25% of employees would be targeted to leave. With the model, 75% employees would be captured - 3X more!

Strategically targeting 25% of high probability customers should yield 75% of potential positive responses!

Lift is a multiplier: How many positive responses would you get at random X Lift = How many would you get with the model? 3X shown below.

Example: How does lift apply to attrition? - Providing stock options strategically to high performers that are at high risk. Could strategically focus on those in the high risk category that are working overtime. Stock options would cost overall less when compared to offering stock options randomly. Reduces cost by 33%!

Always explain in terms of more customers, less churn, better quality, reduced lead times, or anything else executives care about.

9.6 Model Performance Comparison Dashboard

plot_h2o_performance brings together the model and business plots into one function: ROC, Precision vs Recall, Gain and Lift plots.

  • order_by <- tolower(order_by[[1]]) simply selects the first in the list in argument order_by = c("auc", "logloss"). If user does not select one, auc will be used as default.
  • rlang::sym(order_by) converts a string stored within a variable to a column name (symbol) that is unevaluated so we can use it later using !! (bang bang)
  • h2o.getModel gets a model from the current session. loadModel retrieves a model that was previously saved to disk.
  • `. Use this pattern! {#firstMap}
  • fct_reorder(!! order_by_expr, .desc = ifelse(order_by =="auc", TRUE, FALSE)) is cool. If the selected metric is auc, the code will order highest to lowest. If the metric is logloss, it sorts the data lowest to highest. Very cool!
  • fct_reorder requires numeric value
  • Combine plots using cowplot
  • glue is used - a better method of pasting strings than paste0.
    • Uses {} to mix code with text. Uses only one string rather than many strings - just easier than paste0
plot_h2o_performance <- function(h2o_leaderboard, newdata, order_by = c("auc", "logloss"), max_models = 3, size = 1.5){
# Inputs
     leaderboard_tbl <- h2o_leaderboard %>%  as.tibble() %>% slice(1:max_models)
     newdata_tbl <- newdata %>% as.tibble()
     order_by <- tolower(order_by[[1]])
     order_by_expr <- rlang::sym(order_by)
     h2o.no_progress()
     
# 1. Model Metrics
     get_model_performance_metrics <- function(model_id, test_tbl){
          model_h2o <- h2o.getModel(model_id)
          perf_h2o <- h2o.performance(model_h2o, newdata = as.h2o(test_tbl))
          perf_h2o %>% h2o.metric() %>% as.tibble() %>% select(threshold, tpr, fpr,
                                                               precision, recall)}
     
     model_metrics_tbl <- leaderboard_tbl %>% 
          mutate(metrics = map(model_id, get_model_performance_metrics, newdata_tbl)) %>% 
          unnest() %>% 
          mutate(
               model_id = as_factor(model_id) %>% 
                    fct_reorder(!! order_by_expr, .desc = ifelse(order_by =="auc", TRUE, FALSE)),
               
               auc = auc %>% round(3) %>% as.character() %>% as_factor() %>% 
                    fct_reorder(as.numeric(model_id)),
               
               logloss = logloss %>% round(4) %>% as.character() %>% as_factor() %>% 
                    fct_reorder(as.numeric(model_id)))
     
     # 1A. ROC Plot
     p1 <- model_metrics_tbl %>% ggplot(aes_string(x = "fpr", 
          y = "tpr", color = "model_id", linetype = order_by)) + geom_line(size = size) +
          theme_tq() + scale_color_tq() + labs(title = "ROC", x = "FPR", y = "TPR") + 
          theme(legend.direction = "vertical") +
          annotate(geom = "text", x = 0.4, y = 0.25, size = 3, color = "DarkSlateGrey",
                label = "TPR - Rate people correctly identified as leaving") +
          annotate(geom = "text", x = 0.4, y = 0.20, size = 3, color = "DarkSlateGrey",
                label = "FPR - Rate people that stay incorrectly identified as leaving")
     
     # 1B. Precision vs Recall Plot
     p2 <- model_metrics_tbl %>% 
          ggplot(aes_string(x = "recall", y = "precision", color = "model_id", linetype = order_by)) + 
          geom_line(size = size) +
          theme_tq() + scale_color_tq() + labs(title = "Precision vs Recall", 
               x =  "Recall", y = "Precision") +
          theme(legend.position = "none") + 
          annotate(geom = "text", x = 0.4, y = 0.25, size = 3, color = "DarkSlateGrey",
               label = "Precision - Predicting employees to leave buy actually stay") +
          annotate(geom = "text", x = 0.4, y = 0.20, size = 3, color = "DarkSlateGrey",
               label = "Recall - Predicting employee to stay but actually leaves")

# 2. Gain / Lift
     get_gain_lift <- function(model_id, test_tbl){
          model_h2o <- h2o.getModel(model_id)
          perf_h2o <- h2o.performance(model_h2o, newdata = as.h2o(test_tbl))
          perf_h2o %>% h2o.gainsLift() %>% as.tibble() %>% 
               select(group, cumulative_data_fraction, cumulative_capture_rate, cumulative_lift)} 
     
     gain_lift_tbl <- leaderboard_tbl %>% 
          mutate(metrics = map(model_id, get_gain_lift, newdata_tbl)) %>% unnest() %>% 
          mutate(
               model_id = as_factor(model_id) %>% 
                    fct_reorder(!! order_by_expr, .desc = ifelse(order_by == "auc", TRUE, FALSE)),
               
               auc = auc %>% round(3) %>%  as.character() %>% as_factor() %>%
                    fct_reorder(as.numeric(model_id)),
               
               logloss = logloss %>% round(4) %>% as.character() %>% as_factor() %>% 
                    fct_reorder(as.numeric(model_id))) %>% 
          
          rename(gain = cumulative_capture_rate, lift = cumulative_lift)
     # 2A. Gain Plot
     
     p3 <- gain_lift_tbl %>% ggplot(aes_string(x = "cumulative_data_fraction", y = "gain", 
               color= "model_id", linetype = order_by)) + geom_line(size = size) +
          geom_segment(x = 0, y = 0, xend = 1, yend = 1, color = "black", size = size) + theme_tq() + scale_color_tq() +
          expand_limits(x = c(0, 1), y = c(0, 1)) + labs(title = "Gain", x = "Cumulative Data Fraction", y = "Gain") +
          theme(legend.position = "none")
     
     # 2B. Lift Plot
     p4 <- gain_lift_tbl %>% ggplot(aes_string(x = "cumulative_data_fraction", y = "lift", 
               color= "model_id", linetype = order_by)) + geom_line(size = size) +
          geom_segment(x = 0, y = 1, xend = 1, yend = 1, color = "black", size = size) + theme_tq() + scale_color_tq() +
          expand_limits(x = c(0, 1), y = c(0, 1)) + labs(title = "Lift", x = "Cumulative Data Fraction", y = "Lift") +
          theme(legend.position = "none")
# Cow Plot     
     p_legend <- cowplot::get_legend(p1)
     p1 <- p1 + theme(legend.position = "none")
     
     p <- cowplot::plot_grid(p1, p2, p3, p4, ncol=2)
     
     p_title <- ggdraw() + draw_label("H2O Model Metrics", size = 18, fontface = "bold",
                     colour = palette_light()[[1]])#MUST use COLOUR
     p_subtitle <- ggdraw() + draw_label(glue("Ordered by {toupper(order_by)}"), size = 10, 
                    colour = palette_light()[[1]])
     ret <- plot_grid(p_title, p_subtitle, p, p_legend, ncol = 1, rel_heights = c(0.05, 0.05, 1, 0.05 * max_models))
     h2o.show_progress()
     return(ret)}

Test the new function.

10 Explaining Models With LIME

Take the developed H2O models and show how to explain localized prediction results using a special technique called LIME (Local Interpretable Model-Agnostic Explanations).

  • How to make an explainer using the lime() function for single and multiple employees
  • How to make an explanation using the explain() function for single and multiple employees
  • How to visualize model explanations using plot_features() (one or a small number of observations) and plot_explanations() (for many observations)

This section answers the question WHY something happens - at the record level.

10.1 Feature Explanation With LIME

10.1.1 Investigating Predictions

This is a preparatory step before getting into the weeds with lime. This should be familiar and serves as a review.

## # A tibble: 6 x 3
##   predict    No    Yes
##   <fct>   <dbl>  <dbl>
## 1 No      0.964 0.0359
## 2 No      0.891 0.109 
## 3 No      0.967 0.0330
## 4 No      0.953 0.0471
## 5 Yes     0.210 0.790 
## 6 No      0.937 0.0631

Focus on record 5 since it is predicted to turnover with a high probability. Get more information about this person so we know what actually happened and also get the employees ID so we know who we are talking about.

## # A tibble: 6 x 5
##   predict    No    Yes Attrition EmployeeNumber
##   <fct>   <dbl>  <dbl> <fct>              <dbl>
## 1 No      0.964 0.0359 No                   228
## 2 No      0.891 0.109  No                  1278
## 3 No      0.967 0.0330 No                  1250
## 4 No      0.953 0.0471 No                  2065
## 5 Yes     0.210 0.790  Yes                 1767
## 6 No      0.937 0.0631 No                  1308

We now know the employee actually did turnover.

Now find the detail about employee 1767 that did turnover as predicted.

## Observations: 1
## Variables: 32
## $ Age                      <dbl> 43
## $ Attrition                <fct> Yes
## $ BusinessTravel           <fct> Travel_Frequently
## $ DailyRate                <dbl> 807
## $ Department               <fct> Research & Development
## $ DistanceFromHome         <dbl> 17
## $ Education                <fct> Bachelor
## $ EducationField           <fct> Technical Degree
## $ EmployeeNumber           <dbl> 1767
## $ EnvironmentSatisfaction  <fct> High
## $ Gender                   <fct> Male
## $ HourlyRate               <dbl> 38
## $ JobInvolvement           <fct> Medium
## $ JobLevel                 <fct> 1
## $ JobRole                  <fct> Research Scientist
## $ JobSatisfaction          <fct> High
## $ MaritalStatus            <fct> Married
## $ MonthlyIncome            <dbl> 2437
## $ MonthlyRate              <dbl> 15587
## $ NumCompaniesWorked       <dbl> 9
## $ OverTime                 <fct> Yes
## $ PercentSalaryHike        <dbl> 16
## $ PerformanceRating        <fct> Excellent
## $ RelationshipSatisfaction <fct> Very High
## $ StockOptionLevel         <fct> 1
## $ TotalWorkingYears        <dbl> 6
## $ TrainingTimesLastYear    <dbl> 4
## $ WorkLifeBalance          <fct> Better
## $ YearsAtCompany           <dbl> 1
## $ YearsInCurrentRole       <dbl> 0
## $ YearsSinceLastPromotion  <dbl> 0
## $ YearsWithCurrManager     <dbl> 0

It appears some factors support attrition and some do not. If we want to know why he turned over, use lime.

10.1.2 Lime For Single Explanation

LIME is used to determine which features contribute to the prediction (and by how much) for a single (local) observation

  • 2 steps:
    1.Build explainer with lime()
    1. Create Explanation with explain()

h2o, keras and caret have been integrated into lime(). For other packages, use special functions model_type() and predict_model. http://uc-r.github.io/lime

  1. To build an explainer, use the ML model and feature distributions (bins) for the training data.
  • drop Attrition for the H2O model
  • bin_continuous makes it easy to detect what causes a continuous feature to have a high feature weight in the lime explanation
  • quantile_bins tells how to distribute observations within the bins. If TRUE, cuts will be evenly distributed within each bin
  • N_bins: 4-5 bins is typically sufficient. If N_bins = 4,bin_cuts` will have 5 cuts. The bins are then the observations that fall in between the cuts.

The output is voluminous and omitted to save pages of printing! It includes how continuous variables are cut. This is important information.

explainer provides much data - be careful before printing.

  1. Create explanation with explain

LIME 6 steps for explanation:

  1. Given an observation, permute it to create replicated feature data with slight value modifications.
  2. Compute similarity distance measure between original observation and permuted observations.
  3. Apply selected machine learning model to predict outcomes of permuted data.
  4. Select the m number pf features to best describe predicted outcomes.
  5. Fit a simple model to the permuted data explaining the complex model outcome with m number features from the permuted data by its similarity to the original observation.
  6. Use the resulting feature weights to explain local behavior.
  • n_labels: binary classification therefore just one label
  • n_features: number of important features to return to explain the record prediction
  • n_permutations: the more test/evaluations, the better the explanations (typically). Lime will create 5000 localized linear models for evaluation.
  • kernel_width: Affects the LIME linear model (R^2) and therefore should be tuned to ensure you get the best explanations - IMPORTANT
  • Attrition is removed because the data argument must match the format the model requires to predict. Since H2O requires x to be without the target variable, it must be removed below.
  • Use lime::explain because explain is a common function used in other packages.
##       model_type case label label_prob model_r2 model_intercept
## 1 classification    1   Yes       0.79    0.305          0.0699
## 2 classification    1   Yes       0.79    0.305          0.0699
## 3 classification    1   Yes       0.79    0.305          0.0699
## 4 classification    1   Yes       0.79    0.305          0.0699
## 5 classification    1   Yes       0.79    0.305          0.0699
## 6 classification    1   Yes       0.79    0.305          0.0699
## 7 classification    1   Yes       0.79    0.305          0.0699
## 8 classification    1   Yes       0.79    0.305          0.0699
##   model_prediction            feature feature_value feature_weight
## 1            0.492           OverTime             2         0.1783
## 2            0.492 NumCompaniesWorked             9         0.0887
## 3            0.492           JobLevel             1         0.0896
## 4            0.492      MonthlyIncome          2437         0.0799
## 5            0.492   StockOptionLevel             2        -0.0769
## 6            0.492   DistanceFromHome            17         0.0640
## 7            0.492     BusinessTravel             3         0.0569
## 8            0.492     YearsAtCompany             1        -0.0583
##                         feature_desc
## 1                     OverTime = Yes
## 2             4 < NumCompaniesWorked
## 3                       JobLevel = 1
## 4              MonthlyIncome <= 2935
## 5               StockOptionLevel = 1
## 6            13.8 < DistanceFromHome
## 7 BusinessTravel = Travel_Frequently
## 8                YearsAtCompany <= 3
##                                                                                                          data
## 1 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 2 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 3 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 4 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 5 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 6 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 7 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
## 8 43, 3, 807, 2, 17, 3, 6, 1767, 3, 2, 38, 2, 1, 7, 3, 2, 2437, 15587, 9, 2, 16, 3, 4, 2, 6, 4, 3, 1, 0, 0, 0
##   prediction
## 1 0.21, 0.79
## 2 0.21, 0.79
## 3 0.21, 0.79
## 4 0.21, 0.79
## 5 0.21, 0.79
## 6 0.21, 0.79
## 7 0.21, 0.79
## 8 0.21, 0.79
## [1] 0.305

Want R^2 to be as high as possible - optimize by adjusting kernel_width

In this case, kernel_width = 0.5 works relatively well. Trying 0.2 - 1.5 : 0.2 performed best

## [1] 0.315
## [1] 0.312

Explore other tuning parameters

Note we selected 8 features in the explainer so there are 8 records below.

## # A tibble: 8 x 6
##   feature   feature_value feature_weight feature_desc     data  prediction
##   <chr>             <dbl>          <dbl> <chr>            <lis> <list>    
## 1 OverTime              2         0.184  OverTime = Yes   <lis~ <list [2]>
## 2 JobLevel              1         0.0964 JobLevel = 1     <lis~ <list [2]>
## 3 NumCompa~             9         0.0814 4 < NumCompanie~ <lis~ <list [2]>
## 4 StockOpt~             2        -0.0758 StockOptionLeve~ <lis~ <list [2]>
## 5 Business~             3         0.0773 BusinessTravel ~ <lis~ <list [2]>
## 6 Distance~            17         0.0635 13.8 < Distance~ <lis~ <list [2]>
## 7 MonthlyI~          2437         0.0601 MonthlyIncome <~ <lis~ <list [2]>
## 8 YearsAtC~             1        -0.0557 YearsAtCompany ~ <lis~ <list [2]>
  • feature_value is the actual value used. Fro example, this record indicates 9 companies the employee has worked. There are 17 miles between home and work.
  • feature_weight indicates the importance. +/- indicates support (positively correlated) or contradict.
  • Positive weights support the positive binary classification (Yes or attrition in this case)

kernel_width - R^2 values: 0.3: 0.303 0.4:0.294; 1.0:0.309; 1.5:0.294

In this case (values may vary in RMD), kernel_width = 1 produced the highest R^2 value.

10.1.3 Visualizing A Single Explanation: plot_features()

  • ncol = 1 since we only have one record to examine - at this time.
  • note there are 3 red bars - the same number of negative feature_weights in the table above.
  • Note 4 < NumCompaniesWorked is a label from continuous binning performed earlier
  • Red contradict features represent those features that encourage an employee to stay (supports the NO classification)
  • GREEN outweigh the RED
  • Explanation Fit is the R^2 value

10.1.4 Visualizing Multiple Explanations: plot_explanations()

Modify the slice function for 20 observations by modifying slice.

## # A tibble: 160 x 13
##    model_type case  label label_prob model_r2 model_intercept
##    <chr>      <chr> <chr>      <dbl>    <dbl>           <dbl>
##  1 classific~ 1     No         0.964    0.272           0.655
##  2 classific~ 1     No         0.964    0.272           0.655
##  3 classific~ 1     No         0.964    0.272           0.655
##  4 classific~ 1     No         0.964    0.272           0.655
##  5 classific~ 1     No         0.964    0.272           0.655
##  6 classific~ 1     No         0.964    0.272           0.655
##  7 classific~ 1     No         0.964    0.272           0.655
##  8 classific~ 1     No         0.964    0.272           0.655
##  9 classific~ 2     No         0.891    0.335           0.804
## 10 classific~ 2     No         0.891    0.335           0.804
## # ... with 150 more rows, and 7 more variables: model_prediction <dbl>,
## #   feature <chr>, feature_value <dbl>, feature_weight <dbl>,
## #   feature_desc <chr>, data <list>, prediction <list>
## [1] 0.239 0.335

Too much information. The plot is unreadable. Use another method - plot_explanations

- The No panel indicates the features that support or contradict attrition. Even when there is support for attrition, other factors balance the employee’s experience. - 5, 11, 20 are likely to leave since they are in the Yes panel. - Overtime = Yes is dark red on the No column contradicting a NO response - If use more than 20 cases, it becomes challenging to decipher. The table view might be easier in those cases. Can always dplyr::summarize on the case number to group individual employees

10.1.5 Recreating plot_features() & plot_explanations()

Extend the Lime plots with `theme_tq.

Single case:{#plotlimefeaturetq}

10.1.6 Aside: Glue

Expressions enclosed by braces will be evaluated as R code. Long strings are broken by line and concatenated together. Leading white space and blank lines from the first and last lines are automatically trimmed. Example:

11 Calculating Expected ROI Of Policy Change

This teaches you how to add value to organizations - by providing ROI-Driven Data Science! This is split in two:

  1. Covers calculating expected savings of a policy change (really multiple policy changes)
    • learn how to calculate a simple policy change - Implementing a No Overtime Policy. You will calculate the expected savings comparing to the current policy of allowing overtime. Next, you will develop a more sophisticated policy change - Implementing a Targeted Overtime Policy - wherein you target high-flight risk employees leaving the lower risk employees to work overtime as they please. We’ll teach you everything you need to know about the Expected Value Framework so you can begin to implement this for ANY BINARY CLASSIFICATION MODEL. This includes Customer Churn, Targeted Advertisements (think mailers), and more!
  2. Take targeted overtime policy and expands it to cover threshold optimization and sensitivity analysis, two of the most important topics in Data Science for Business!
    • Perform two advanced analyses that are the most critical:
      • Threshold Optimization - A method used to maximize expected saving via iteratively calculating savings at various thresholds
      • Sensitivity Analysis - A method used to investigate how sensitive the expected savings is to various parameter values that were created based on assumptions

11.1 BSPF Update

Starting on Step 2 of the BSPF Encode Algorithms Step:

This marks the beginning of the CRISP-DM Phase 5: Evaluation.

11.2 Expected Value Framework

Using ROI, there will be two solutions are developed:

  1. A Simple Model where OT is reduced for everyone
  2. A more mature model where OT is reduced over time by targeting high risk employees.

Threshold Optimization and Sensitivity Analysis will be developed for optimization.

  • Threshold Optimization is an iterative approach to expected value. It is critical to maximize profitability (or savings) of a change.
    • Expected Value is a way of assigning a value to a decision using both the probability of its occurence and the potential benefit of the outcome.
    • Using expected value transforms decisions to be based on probability rather than intuition

  • Sensitivity Analysis adjusts parameters to illustrate their impact on the model. It is an iterative approach to assess ROI variability.

There are two key terms to understand that support the forthcoming code development:

  1. *Attrition Cost**: The cost resulting primarily from lost productivity quantified as a function of salary and benefits, etc.
  2. Policy Change Cost: The cost associated with reduction of overtime (it could be another correlated features) given as a percentage of Attrition Cost

All costs are annualized

11.3 Expected ROI: No Over Time Policy

Calculate the expected of changing all employees to no overtime.

3 Steps:

  1. Calculate the expected value with overtime as it is now (baseline).
  2. Calculate the expected value with no overtime.
  3. Calculate the difference and make a decision.

11.3.1 Expected Value Baseline w OT: Step 1

  • Recall assess_attrition contains several functions:
    • assess_attrition
    • calculate_attrition_cost - This is the one that is needed.
    • count_to_pct
    • plot_attrition
  • bind_cols adds columns from test.tbl to the three columns provided by automl_leader %>% h2o.predict(newdata = as.h2o(test_tbl)) %>% as.tibble()
    • Employee Number useful to identify and track employees
    • Monthly Income is used later to estimate the cost of an OT policy change
    • Overtime is needed to determine whether or not the employee is working OT

## # A tibble: 6 x 6
##   predict    No    Yes EmployeeNumber MonthlyIncome OverTime
##   <fct>   <dbl>  <dbl>          <dbl>         <dbl> <fct>   
## 1 No      0.964 0.0359            228          9419 No      
## 2 No      0.891 0.109            1278         13577 Yes     
## 3 No      0.967 0.0330           1250         17779 No      
## 4 No      0.953 0.0471           2065          5390 No      
## 5 Yes     0.210 0.790            1767          2437 Yes     
## 6 No      0.937 0.0631           1308          2372 No

Recall calculate_attrition_cost:

## function (n = 1, salary = 80000, separation_cost = 500, vacancy_cost = 10000, 
##     acquisition_cost = 4900, placement_cost = 3500, net_revenue_per_employee = 250000, 
##     workdays_per_year = 240, workdays_position_open = 40, workdays_onboarding = 60, 
##     onboarding_efficiency = 0.5) 
## {
##     direct_cost <- sum(separation_cost, vacancy_cost, acquisition_cost, 
##         placement_cost)
##     productivy_cost <- net_revenue_per_employee/workdays_per_year * 
##         (workdays_position_open + workdays_onboarding * onboarding_efficiency)
##     salary_benefit_reduction <- salary/workdays_per_year * workdays_position_open
##     cost_per_employee <- direct_cost + productivy_cost - salary_benefit_reduction
##     total_cost <- n * cost_per_employee
##     return(total_cost)
## }

Some changes we can make:

  • annualize the monthly income to calculate the real salary per employee
  • capturing net_revenue_per_employee so it can be manipulated
  • cost_of_policy_change = 0 because there is no change in the OT policy - keeping everything the same for the baseline
  • There is only attrition cost if the employee leaves. The Yes column is the likelihood of leaving (when combined can get an expected attrition cost - coming soon)
  • The simplified equation for expected cost:
    • \(Expected Cost = Yes * (AttritionCost + CostOfPolicyChange) + No * (CostOfPolicyChange)\) - Remember Yes and No are probabilities - Because cost_of_policy_change is 0, \(ExpectedCost = Yes * AttritionCost\)
## # A tibble: 220 x 9
##    predict    No    Yes EmployeeNumber MonthlyIncome OverTime
##    <fct>   <dbl>  <dbl>          <dbl>         <dbl> <fct>   
##  1 No      0.964 0.0359            228          9419 No      
##  2 No      0.891 0.109            1278         13577 Yes     
##  3 No      0.967 0.0330           1250         17779 No      
##  4 No      0.953 0.0471           2065          5390 No      
##  5 Yes     0.210 0.790            1767          2437 Yes     
##  6 No      0.937 0.0631           1308          2372 No      
##  7 No      0.956 0.0444             18          2661 No      
##  8 No      0.960 0.0403            460          6347 No      
##  9 No      0.963 0.0370           1369          5363 No      
## 10 No      0.959 0.0414           1040          6347 No      
## # ... with 210 more rows, and 3 more variables: attrition_cost <dbl>,
## #   cost_of_policy_change <dbl>, expected_attrition_cost <dbl>

Now calculate the total attrition cost with OT - the baseline case. Simple to to with summarize.

## # A tibble: 1 x 1
##   total_expected_attrition_cost_0
##                             <dbl>
## 1                        3078803.

11.3.2 Expected Value No OT: Step 2

  • Modify test_tbl to change all OT Yes to No. fct_recode makes this easy!
##   [1] No No No No No No No No No No No No No No No No No No No No No No No
##  [24] No No No No No No No No No No No No No No No No No No No No No No No
##  [47] No No No No No No No No No No No No No No No No No No No No No No No
##  [70] No No No No No No No No No No No No No No No No No No No No No No No
##  [93] No No No No No No No No No No No No No No No No No No No No No No No
## [116] No No No No No No No No No No No No No No No No No No No No No No No
## [139] No No No No No No No No No No No No No No No No No No No No No No No
## [162] No No No No No No No No No No No No No No No No No No No No No No No
## [185] No No No No No No No No No No No No No No No No No No No No No No No
## [208] No No No No No No No No No No No No No
## Levels: No
  • Use test_without_OT_tbl to make new predictions using the same code but using a different data file.
  • Use rename OverTime so the OT values are side-by-side
## # A tibble: 6 x 7
##   predict    No    Yes EmployeeNumber MonthlyIncome OverTime_0 OverTime_1
##   <fct>   <dbl>  <dbl>          <dbl>         <dbl> <fct>      <fct>     
## 1 No      0.964 0.0359            228          9419 No         No        
## 2 No      0.953 0.0465           1278         13577 Yes        No        
## 3 No      0.967 0.0330           1250         17779 No         No        
## 4 No      0.953 0.0471           2065          5390 No         No        
## 5 No      0.793 0.207            1767          2437 Yes        No        
## 6 No      0.937 0.0631           1308          2372 No         No

Note all the values for Overtime_1 = No. The new overtime values for Overtime_1 will have an impact on the expected value calculation.

11.3.3 Calculate expected value NO OT

  • The cost of a policy change will NOT be 0 this time. Instances when OT changes from Yes to No, there is a cost associated with this:
    • \(cost_of_policy_change = OverTime_0 =="Yes" & OverTime_1 =="No" ~ avg_overtime_pct * attrition_cost\)
    • This cost arises primarily from the lost productivity as a result of not working & producing during OT
      • Add new variable avg_overtime_pct. This is a business decision. The goal is to update the calculation to reflect a new policy based on realistic parameters the organization supports.
## # A tibble: 220 x 6
##    MonthlyIncome OverTime_0 OverTime_1 attrition_cost cost_of_policy_~
##            <dbl> <fct>      <fct>               <dbl>            <dbl>
##  1          9419 No         No                 72979.               0 
##  2         13577 Yes        No                 64663.            6466.
##  3         17779 No         No                 56259.               0 
##  4          5390 No         No                 81037.               0 
##  5          2437 Yes        No                 86943.            8694.
##  6          2372 No         No                 87073.               0 
##  7          2661 No         No                 86495.               0 
##  8          6347 No         No                 79123.               0 
##  9          5363 No         No                 81091.               0 
## 10          6347 No         No                 79123.               0 
## # ... with 210 more rows, and 1 more variable:
## #   expected_attrition_cost <dbl>

11.3.4 Calculate total attrition cost

## # A tibble: 1 x 1
##   total_expected_attrition_cost_1
##                             <dbl>
## 1                        2666825.

Recall the total attrition cost for the baseline (with OT) was 3078802.893 - the cost without OT is lower!

11.3.5 Calculate the Difference & Make Decision

## # A tibble: 1 x 4
##   total_expected_attrition_~ total_expected_attrition~ savings pct_savings
##                        <dbl>                     <dbl>   <dbl>       <dbl>
## 1                   3078803.                  2666825. 411978.       0.134

By eliminating OT, there is a potential to save over $400k by eliminating OT. Recall we are examining only 200 records. To annualize this, consider adding the records from train with an additional 1250 employees:

\((train + test)/test = (1250 + 220) / 220 = 6.7; 6.7 * 400000 = 2,680,000\)

11.4 Targeting OT By Threshold

The previous analysis of the baseline compared to a policy of no OT for everyone. If the OT policy was targeted towards those that are likely to leave, then perhaps greater savings might be realized.

11.4.1 Threshold Primer: Confusion Matrix

  • The rows are actuals No and Yes; the columns are predicted No and Yes
  • The False Negatives in Yes row / No column are the most expensive to miss incorrectly - Predicted to STAY but actually LEAVES
  • False Positives are not so damaging - predict someone will leave but actually stays - no harm done
  • The F1 threshold in the output determines the mix in the confusion matrix
  • F1 is the optimal balance between Precision and Recall. However, the threshold @ Max F1 is not typically the optimal value for the business case because FNs are typically more costly then FPs.
## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.485216626479869:
##         No Yes    Error     Rate
## No     181   3 0.016304   =3/184
## Yes     16  20 0.444444   =16/36
## Totals 197  23 0.086364  =19/220

The confusion matrix above is based on a specific F1 threshold

What needs to be evaluated is how the confusion matrix changes over different values of F1. Typically seek to reduce false negatives (because they are often more expensive).

11.4.2 Threshold Primer: Expected Rates

  • h2o.metric returns the classifier performance metrics by threshold.
## Observations: 220
## Variables: 20
## $ threshold               <dbl> 0.966, 0.899, 0.899, 0.844, 0.839, 0.8...
## $ f1                      <dbl> 0.0541, 0.1053, 0.1538, 0.2000, 0.2439...
## $ f2                      <dbl> 0.0345, 0.0685, 0.1020, 0.1351, 0.1678...
## $ f0point5                <dbl> 0.125, 0.227, 0.312, 0.385, 0.446, 0.5...
## $ accuracy                <dbl> 0.841, 0.845, 0.850, 0.855, 0.859, 0.8...
## $ precision               <dbl> 1.000, 1.000, 1.000, 1.000, 1.000, 1.0...
## $ recall                  <dbl> 0.0278, 0.0556, 0.0833, 0.1111, 0.1389...
## $ specificity             <dbl> 1.000, 1.000, 1.000, 1.000, 1.000, 1.0...
## $ absolute_mcc            <dbl> 0.153, 0.217, 0.266, 0.308, 0.345, 0.3...
## $ min_per_class_accuracy  <dbl> 0.0278, 0.0556, 0.0833, 0.1111, 0.1389...
## $ mean_per_class_accuracy <dbl> 0.514, 0.528, 0.542, 0.556, 0.569, 0.5...
## $ tns                     <dbl> 184, 184, 184, 184, 184, 184, 184, 184...
## $ fns                     <dbl> 35, 34, 33, 32, 31, 30, 29, 28, 27, 27...
## $ fps                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ tps                     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 10, 11, ...
## $ tnr                     <dbl> 1.000, 1.000, 1.000, 1.000, 1.000, 1.0...
## $ fnr                     <dbl> 0.972, 0.944, 0.917, 0.889, 0.861, 0.8...
## $ fpr                     <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0....
## $ tpr                     <dbl> 0.0278, 0.0556, 0.0833, 0.1111, 0.1389...
## $ idx                     <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ...

Note how the threshold changes other key values.

Only a subset of this data is needed for further analysis.

## # A tibble: 220 x 6
##    threshold     f1   tnr   fnr     fpr    tpr
##        <dbl>  <dbl> <dbl> <dbl>   <dbl>  <dbl>
##  1     0.966 0.0541 1     0.972 0       0.0278
##  2     0.899 0.105  1     0.944 0       0.0556
##  3     0.899 0.154  1     0.917 0       0.0833
##  4     0.844 0.20   1     0.889 0       0.111 
##  5     0.839 0.244  1     0.861 0       0.139 
##  6     0.816 0.286  1     0.833 0       0.167 
##  7     0.815 0.326  1     0.806 0       0.194 
##  8     0.812 0.364  1     0.778 0       0.222 
##  9     0.790 0.4    1     0.75  0       0.25  
## 10     0.773 0.391  0.995 0.75  0.00543 0.25  
## # ... with 210 more rows

It is easy to get the values for the max F1.

## # A tibble: 1 x 6
##   threshold    f1   tnr   fnr    fpr   tpr
##       <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1     0.485 0.678 0.984 0.444 0.0163 0.556

(slice makes sure only 1 F1 record is returned - there can be more than one.)

11.4.3 Threshold Primer: Visualizing Rates

Visualize how the critical confusion matrix values vary by threshold

  • Use gather to go from wide to long format - required for ggplot groups
  • factor_key = TRUE keeps the order of tnr:tpr
    • factor_key: If FALSE, the default, the key values will be stored as a character vector. If TRUE, will be stored as a factor, which preserves the original ordering of the columns.
  • fct_reorder2(key, threshold, value) used when plotting using x and y axis features to control the legend
    • Note the order of the legend matches the order of the values on the right side of the plot
    • the x axis uses Threshold and the y axis uses Value; key controls the legend
    • the key and the threshold have been reordered by key
  • \(tnr + fpr = 1\) and \(fnr + tpr = 1\)

Note at Threshold = 1, the order of the lines from top to bottom match the order of the legend!

  • Group 1: TNR and FPR
    • Probability of correctly identifying a Negative
    • \(TNR + FPR = 1\)
  • Group 2: TPR and FNR
    • Probability or correctly identifying a positive
    • \(TPR + FNR = 1\)

FPs and FNs result in costs

11.5 Targeted Over Time Policy - Expected ROI

11.5.2 Expected Cost Of Targeted OT

Review a previous dataframe, rates_by_threshold_tbl:

## # A tibble: 220 x 20
##    threshold     f1     f2 f0point5 accuracy precision recall specificity
##        <dbl>  <dbl>  <dbl>    <dbl>    <dbl>     <dbl>  <dbl>       <dbl>
##  1     0.966 0.0541 0.0345    0.125    0.841       1   0.0278       1    
##  2     0.899 0.105  0.0685    0.227    0.845       1   0.0556       1    
##  3     0.899 0.154  0.102     0.312    0.85        1   0.0833       1    
##  4     0.844 0.20   0.135     0.385    0.855       1   0.111        1    
##  5     0.839 0.244  0.168     0.446    0.859       1   0.139        1    
##  6     0.816 0.286  0.200     0.5      0.864       1   0.167        1    
##  7     0.815 0.326  0.232     0.547    0.868       1   0.194        1    
##  8     0.812 0.364  0.263     0.588    0.873       1   0.222        1    
##  9     0.790 0.4    0.294     0.625    0.877       1   0.25         1    
## 10     0.773 0.391  0.292     0.592    0.873       0.9 0.25         0.995
## # ... with 210 more rows, and 12 more variables: absolute_mcc <dbl>,
## #   min_per_class_accuracy <dbl>, mean_per_class_accuracy <dbl>,
## #   tns <dbl>, fns <dbl>, fps <dbl>, tps <dbl>, tnr <dbl>, fnr <dbl>,
## #   fpr <dbl>, tpr <dbl>, idx <int>

Get the metrics for the max F1

## # A tibble: 1 x 6
##   threshold    f1   tnr   fnr    fpr   tpr
##       <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1     0.485 0.678 0.984 0.444 0.0163 0.556

Collect the values for each variable and save to its own variable.

The max F1 0.485 is the harmonic mean that minimize the FPR and the FNR:

Any records with a max F1, 0.485, or higher focus on them to change OT to No. To do this, need to modify the test.tbl. Anyone with \(Yes >= 0.28 & Overtime = Yes\) then toggle \(Overtime = No\). We are selectively toggling Overtime from Yes to No for employees have a Yes value to No.

  • add_column adds column using a vector of values. From tibble
  • Yes >= threshold ~ factor("No", levels = levels(test_tbl$OverTime))
    • need to make new variable into a factor
    • Use pre-existing levels from test_tbl
## # A tibble: 220 x 32
##      Age Attrition BusinessTravel DailyRate Department DistanceFromHome
##    <dbl> <fct>     <fct>              <dbl> <fct>                 <dbl>
##  1    30 No        Travel_Rarely       1339 Sales                     5
##  2    55 No        Non-Travel           177 Research ~                8
##  3    54 No        Travel_Rarely        685 Research ~                3
##  4    49 No        Travel_Freque~      1023 Sales                     2
##  5    43 Yes       Travel_Freque~       807 Research ~               17
##  6    58 No        Travel_Rarely        848 Research ~               23
##  7    34 No        Travel_Rarely       1346 Research ~               19
##  8    37 No        Travel_Rarely       1192 Research ~                5
##  9    35 No        Travel_Rarely        817 Research ~                1
## 10    50 No        Non-Travel           145 Sales                     1
## # ... with 210 more rows, and 26 more variables: Education <fct>,
## #   EducationField <fct>, EmployeeNumber <dbl>,
## #   EnvironmentSatisfaction <fct>, Gender <fct>, HourlyRate <dbl>,
## #   JobInvolvement <fct>, JobLevel <fct>, JobRole <fct>,
## #   JobSatisfaction <fct>, MaritalStatus <fct>, MonthlyIncome <dbl>,
## #   MonthlyRate <dbl>, NumCompaniesWorked <dbl>, OverTime <fct>,
## #   PercentSalaryHike <dbl>, PerformanceRating <fct>,
## #   RelationshipSatisfaction <fct>, StockOptionLevel <fct>,
## #   TotalWorkingYears <dbl>, TrainingTimesLastYear <dbl>,
## #   WorkLifeBalance <fct>, YearsAtCompany <dbl>, YearsInCurrentRole <dbl>,
## #   YearsSinceLastPromotion <dbl>, YearsWithCurrManager <dbl>

Now the data is prepared, it is time to generate predictions now that we have the test_targeted_OT_tbl.

Running h2o.predict on the new data will output the class probabilities of Attrtition = Yes implementing the policy changes.

## # A tibble: 220 x 7
##    predict    No    Yes EmployeeNumber MonthlyIncome OverTime_0 OverTime_1
##    <fct>   <dbl>  <dbl>          <dbl>         <dbl> <fct>      <fct>     
##  1 No      0.964 0.0359            228          9419 No         No        
##  2 No      0.891 0.109            1278         13577 Yes        Yes       
##  3 No      0.967 0.0330           1250         17779 No         No        
##  4 No      0.953 0.0471           2065          5390 No         No        
##  5 No      0.793 0.207            1767          2437 Yes        No        
##  6 No      0.937 0.0631           1308          2372 No         No        
##  7 No      0.956 0.0444             18          2661 No         No        
##  8 No      0.960 0.0403            460          6347 No         No        
##  9 No      0.963 0.0370           1369          5363 No         No        
## 10 No      0.959 0.0414           1040          6347 No         No        
## # ... with 210 more rows

Before:

##   predict    No    Yes
## 1      No 0.964 0.0359
## 2      No 0.891 0.1090
## 3      No 0.967 0.0330
## 4      No 0.953 0.0471
## 5     Yes 0.210 0.7903
## 6      No 0.937 0.0631
## 
## [220 rows x 3 columns]

After:

##   predict    No    Yes
## 1      No 0.964 0.0359
## 2      No 0.891 0.1090
## 3      No 0.967 0.0330
## 4      No 0.953 0.0471
## 5      No 0.793 0.2070
## 6      No 0.937 0.0631
## 
## [220 rows x 3 columns]

Record #5 changed from 0.832 to 0.264 and prediction changed from Yes to No.

Predictions have been made and saved in predictions_targeted_OT_tbl. Next step is to calculate the costs of the new OT targeted policy. Some of the code is familiar and copied from previous work.

  • calculate attrition cost
  • cost of policy change
  • Calculating the cost benefits (cb) is the biggest change when working with thresholds.
    • cost benefit are local values that adjust by employee that account for True Positive, True Negative, False Positive and False Negative.
      • cb_tn = cost of policy change. This is the cost of the employee staying controlled by the cost of policy change logic because they are being targeted.
      • cb_fp = cost of policy change. This is the cost of predicting an employee to leave but the employee stays. This is controlled by the cost of policy change logic because they were targeted.
      • cb_tp = cost of policy change + attrition cost. This is controlled by the cost of policy change logic AND attrition costs
      • cb_fn = cost of policy change + attrition cost. This is the cost when the employee leaves even though he was predicted to stay. This is the most expensive case. This is controlled by the cost of policy change logic AND attrition costs.
  • expected_attrition_cost calculates the overall expected cost by multiplying the probabilities of the Yes or No by the costs calculated above cb_**.

Memorize/understand TN and FP go together - they add to 1 - and TP and FN go together because they too sum to 1.

Might be helpful to review the Expected Rates plot above. It makes intuitive sense is you think about it.

## # A tibble: 220 x 14
##    predict    No    Yes EmployeeNumber MonthlyIncome OverTime_0 OverTime_1
##    <fct>   <dbl>  <dbl>          <dbl>         <dbl> <fct>      <fct>     
##  1 No      0.964 0.0359            228          9419 No         No        
##  2 No      0.891 0.109            1278         13577 Yes        Yes       
##  3 No      0.967 0.0330           1250         17779 No         No        
##  4 No      0.953 0.0471           2065          5390 No         No        
##  5 No      0.793 0.207            1767          2437 Yes        No        
##  6 No      0.937 0.0631           1308          2372 No         No        
##  7 No      0.956 0.0444             18          2661 No         No        
##  8 No      0.960 0.0403            460          6347 No         No        
##  9 No      0.963 0.0370           1369          5363 No         No        
## 10 No      0.959 0.0414           1040          6347 No         No        
## # ... with 210 more rows, and 7 more variables: attrition_cost <dbl>,
## #   cost_of_policy_change <dbl>, cb_tn <dbl>, cb_fp <dbl>, cb_tp <dbl>,
## #   cb_fn <dbl>, expected_attrition_cost <dbl>

Review the table above with a few examples:

  • 1_st_ record: 4% Yes, no overtime and therefore nothing to adjust. Because change of policy is not a factor, TN and FP are 0 - they are not leaving. If he does leave, (TP, FN) then these cost benefit features are the same as the attrition costs. The expected attrition cost is low because the chance of leaving is low and they are not targeted.

  • The real logic is only leveraged when a record’s threshold - {r threshold} - is exceeded. If someone is working overtime and the threshold is not met, then the overtime stays Yes. Look at record 2 and compare to record 5. If record 2 were to leave, then the full cost would be 64662.667 rather than the expected attrition cost based on probabilities - {r ev_targeted_OT_tbl[2, "expected_attrition_cost"]}.

The final step is to calculate the final expected value.

## # A tibble: 1 x 1
##   total_expected_attrition_cost_1
##                             <dbl>
## 1                        2683388.

11.5.3 Expected Savings - Targeted OT Policy

## # A tibble: 1 x 4
##   total_expected_attrition_~ total_expected_attrition~ savings pct_savings
##                        <dbl>                     <dbl>   <dbl>       <dbl>
## 1                   3078803.                  2683388. 395415.       0.128

Using the targeted OT policy saves ~k500. Recall the targeted OT policy is based on the max F1 threshold. The question next is: Does the max F1 threshold provide the greatest savings?

12 Threshold Optimization: Maximizing Expected ROI

Is there a better threshold that can provide more savings than the savings using the max F1? We should always check to see this there is a more optimal threshold.

This requires 2 steps:

  1. Create a function that can be used on a single threshold
  2. Iteratively apply the function to optimize

12.1 Calculate Single Threshold

calculate_savings_by_threshold calculates and returns the savings when the user provides data, h2o model, the threshold and the expected rates (tnr, fpr, fnr, tpr).

The code below is familiar; this time code is combined into a single function. All that is returned is a single value - the savings.

  • The default values set the threshold so low that all OT is eliminated - the same as the no OT use case we completed before.
    • bind_cols is a dplyr function that does what it says. In this case, the predict, No, and Yes columns returned by h2o_model %>% h2o.predict(newdata = as.h2o(data_0_tbl)) %>% as.tibble() is bound with the notes 3 features from the original data in data/data_0_tbl and saved as pred_0_tbl.
    • ev_0_tbl takes pred_0_tbl with the predictions and adds some more features that have been covered before.
    • After dplyr summary code, the expected value is saved in total_ev_0_tbl
  • total_ev_0_tbl simply provides the integer for the expected attrition cost.
  • data_1_tbl get s new new column Yes to overwrite the overtime feature
    • this is the same code explored before.
    • Everyone is switch to no because the threshold is so low.
    • Predictions must be run again.
    • Again we get columns Overtime_0 and Overtime_1 again to show who changed Yes to No
    • The rest is code we just reviewed recently.
calculate_savings_by_threshold <- function(data, h2o_model, threshold = 0,
                                           tnr = 0, fpr = 1, fnr = 0, tpr = 1) {
    data_0_tbl <- as.tibble(data)
    
# Calculating Expected Value With OT 
    pred_0_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_0_tbl)) %>% as.tibble() %>%
        bind_cols(data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime))
    
    ev_0_tbl <- pred_0_tbl %>%
        mutate(attrition_cost = calculate_attrition_cost(n = 1,salary = MonthlyIncome * 12,
                                                         net_revenue_per_employee = 250000)) %>%
        mutate(cost_of_policy_change = 0) %>%
        mutate(expected_attrition_cost = 
                    Yes * (attrition_cost + cost_of_policy_change) + No *  (cost_of_policy_change))
    
    total_ev_0_tbl <- ev_0_tbl %>% summarise(
         total_expected_attrition_cost_0 = sum(expected_attrition_cost))
    
#Calculating Expected Value With Targeted OT
    data_1_tbl <- data_0_tbl %>% add_column(Yes = pred_0_tbl$Yes) %>%
        mutate( OverTime = case_when(
             Yes >= threshold ~ factor("No", levels = levels(data_0_tbl$OverTime)), 
             TRUE ~ OverTime)) %>% select(-Yes) 
    
    pred_1_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_1_tbl)) %>% as.tibble() %>%
        bind_cols(data_0_tbl %>%
                       select(EmployeeNumber, MonthlyIncome, OverTime), data_1_tbl %>%
        select(OverTime)) %>% rename(OverTime_0 = OverTime, OverTime_1 = OverTime1)
    
    avg_overtime_pct <- 0.10
    
    ev_1_tbl <- pred_1_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1,
                                                       salary = MonthlyIncome * 12,
                                                       net_revenue_per_employee = 250000)) %>%
        mutate(cost_of_policy_change = case_when(
             OverTime_1 == "No" & OverTime_0 == "Yes" ~ attrition_cost * avg_overtime_pct,
             TRUE ~ 0))%>%
        
         mutate(cb_tn = cost_of_policy_change, cb_fp = cost_of_policy_change,
            cb_fn = attrition_cost + cost_of_policy_change, 
            cb_tp = attrition_cost + cost_of_policy_change,
            expected_attrition_cost = Yes * (tpr*cb_tp + fnr*cb_fn) + No * (tnr*cb_tn + fpr*cb_fp))
    
    total_ev_1_tbl <- ev_1_tbl %>%
        summarise(total_expected_attrition_cost_1 = sum(expected_attrition_cost))
    
#Savings Calculation
    savings_tbl <- bind_cols(total_ev_0_tbl, total_ev_1_tbl) %>%
        mutate(
            savings = total_expected_attrition_cost_0 - total_expected_attrition_cost_1,
            pct_savings = savings / total_expected_attrition_cost_0)
    
    return(savings_tbl$savings)}

12.1.1 Test the new function

Helpful to recall this plot to understand the values from the confusion matrix.

  • No OT Policy where \(tnr = 0 ; fnr = 0 ; threshold = 0\)
    • Because \(threshold = 0\), everyone get transformed Yes to No
## [1] 411978

So the expected savings for a no OT policy is ~$410k

  • Do Nothing Policy
    • Threshold is so high no one gets changed from Yes to No.
    • The values for the confusion matrix are reversed and the threshold is set to its max - a value no one can meet.

## [1] 0

A savings of $0 makes sense because a threshold = 1 is basically not changing anything. So if there are no changes, there are no savings.

  • Threshold @ Max F1
## [1] 395415

12.2 Optimization with purrr

Time to now iteratively discover the optimal threshold to find the max savings. Will reuse parts of rates_by_threshold_tbl:

## # A tibble: 6 x 5
##   threshold   tnr   fnr   fpr    tpr
##       <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1     0.966     1 0.972     0 0.0278
## 2     0.899     1 0.944     0 0.0556
## 3     0.899     1 0.917     0 0.0833
## 4     0.844     1 0.889     0 0.111 
## 5     0.839     1 0.861     0 0.139 
## 6     0.816     1 0.833     0 0.167
  • seq is used to use a portion of the thresholds.
    • length.out limits the selection to 20 selections from 1 to 200 whole numbers with round(0)
  • pmap_dbl When mutate is used with purrr::map allows for row-wise mapping for iteration {#secondMap}
    • with map alone, column-wise mapping is provided. {#firstMap}
    • One of the pmap functions that use list inputs (.l) arguments, which allow for any number of inputs to be mapped
      • .l is required; it stores a list that map the arguments to the column in the dataframe
    • pmap_dbl returns a single numerical value as opposed to ’pmap` that returns a list by default
    • purr::map functions are variants of map() iterate over multiple arguments in parallel
    • below, threshold, tnr, fnr, fpr, tpr are all used to to calculate savings
    • recall calculate_savings_by_threshold has the following arguments:
      • function(data, h2o_model, threshold = 0, tnr = 0, fpr = 1, fnr = 0, tpr = 1)

  • .f is a required argument just like .l. - - partial allows you to preload a function with arguments that do not change. {#partial} - in calculate_savings_by_threshold the arguments data and h2o_model do not change - this is very useful for mapping when arguments like these never change during the iteration
## # A tibble: 20 x 6
##    threshold    tnr    fnr     fpr    tpr savings
##        <dbl>  <dbl>  <dbl>   <dbl>  <dbl>   <dbl>
##  1    0.966  1      0.972  0       0.0278  11918.
##  2    0.737  0.995  0.667  0.00543 0.333  249517.
##  3    0.450  0.978  0.444  0.0217  0.556  417371.
##  4    0.315  0.929  0.361  0.0707  0.639  456590.
##  5    0.225  0.891  0.25   0.109   0.75   504173.
##  6    0.181  0.826  0.25   0.174   0.75   539867.
##  7    0.128  0.772  0.222  0.228   0.778  538986.
##  8    0.113  0.712  0.194  0.288   0.806  529153.
##  9    0.0930 0.652  0.194  0.348   0.806  517312.
## 10    0.0737 0.598  0.139  0.402   0.861  513145.
## 11    0.0648 0.543  0.111  0.457   0.889  490610.
## 12    0.0570 0.478  0.111  0.522   0.889  477937.
## 13    0.0513 0.424  0.0833 0.576   0.917  464770.
## 14    0.0471 0.359  0.0833 0.641   0.917  437500.
## 15    0.0431 0.310  0.0278 0.690   0.972  429991.
## 16    0.0403 0.25   0      0.75    1      417023.
## 17    0.0387 0.190  0      0.810   1      417023.
## 18    0.0370 0.125  0      0.875   1      417023.
## 19    0.0345 0.0652 0      0.935   1      411978.
## 20    0.0312 0      0      1       1      411978.

After using the purrr functions, data is available for plotting.{#expand_limits}{#scale_x_continuous}

Max F1 does not provide the maximum savings!

Added this to ensure the title is accurate

rates_by_threshold_optimized_tbl %>% ggplot(aes(threshold, savings)) +
    geom_line(color = palette_light()[[1]]) + geom_point(color = palette_light()[[1]]) +
    
    # Optimal Point
    geom_point(shape = 21, size = 5, color = palette_light()[[3]],
               data = rates_by_threshold_optimized_tbl %>% filter(savings == max(savings))) +
    geom_label(aes(label = scales::dollar(savings)), vjust = -1, color = palette_light()[[3]],
               data = rates_by_threshold_optimized_tbl %>% filter(savings == max(savings))) +
    
    # F1 Max
    geom_vline(xintercept = max_f1_tbl$threshold, color = palette_light()[[5]], size = 2) +
    annotate(geom = "label", label = scales::dollar(max_f1_savings),
             x = max_f1_tbl$threshold, y = max_f1_savings, vjust = -1, color = palette_light()[[1]]) + 
    
    # No OT Policy
    geom_point(shape = 21, size = 5, color = palette_light()[[2]], data = rates_by_threshold_optimized_tbl %>%
                   filter(threshold == min(threshold))) + 
    geom_label(aes(label = scales::dollar(savings)), vjust = -1, color = palette_light()[[2]],
               data = rates_by_threshold_optimized_tbl %>% filter(threshold == min(threshold))) +
    
    # Do Nothing Policy
    geom_point(shape = 21, size = 5, color = palette_light()[[2]], data = rates_by_threshold_optimized_tbl %>%
                   filter(threshold == max(threshold))) +
    geom_label(aes(label = scales::dollar(round(savings, 0))), vjust = -1, color = palette_light()[[2]],
               data = rates_by_threshold_optimized_tbl %>% filter(threshold == max(threshold))) +
    
    # Aesthestics
    theme_tq() + expand_limits(x = c(-.1, 1.1), y = 8e5) +
    scale_x_continuous(labels = scales::percent, breaks = seq(0, 1, by = 0.2)) +
    scale_y_continuous(labels = scales::dollar) +
     labs(title = glue("Optimization Results: Expected Savings Maximized At {myPctMax}"), 
          x = "Threshold (%)", y = "Savings")

  • Case 1: No OT Policy
## # A tibble: 1 x 6
##   threshold   tnr   fnr   fpr   tpr savings
##       <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1    0.0312     0     0     1     1 411978.
  • Threshold = 0
    • Anyone with OT targeted (Yes turns into No)

This policy will reduce turnover, particularly with the higher OT employees.

Recall this is just for a sample of the population - the savings organizationally are much larger

  • Case 2: Optimization
## # A tibble: 1 x 6
##   threshold   tnr   fnr   fpr   tpr savings
##       <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1     0.181 0.826  0.25 0.174  0.75 539867.
  • Threshold at max savings
    • Target employees by weighted analysis of cost of FN and cost of FP
    • The optimized value is at a lower threshold than the Max F1. This happens because FNs are more costly the FPs.
      • Precision ~ FPs (predict employee leaves buy stays)
      • Recall ~ FNs (predict employee to stay but leaves)
        • Fail to target and do not reduce OT for 100% of Attrition Cost
        • If the cost of attrition is 100k and the cost to target is, worst case 30% OT, 30k then there is a 70k difference
      • FPs are often a fraction of the cost of a FN

When FNs cost more than FP, the threshold will be lower than the max F1

In this case, reduce the threshold from the Max F1 to 18.053

  • Case 3: Max F1

The calculation will be close but not exact. Recall there are only 20 records in rates_by_threshold_optimized_tbl so the closest record is used for the table below.

## # A tibble: 1 x 6
##   threshold   tnr   fnr   fpr   tpr savings
##       <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1     0.225 0.891  0.25 0.109  0.75 504173.
  • Threshold = Max F1
    • Target employees balancing FNs and FPs
      • FN = FP
  • Case4: Do Nothing - make no changes to the OT policy
## # A tibble: 1 x 6
##   threshold   tnr   fnr   fpr    tpr savings
##       <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl>
## 1     0.966     1 0.972     0 0.0278  11918.
  • Threshold ~1
    • Threshold is so high no one is targeted (no one has a 100% chance of leaving)
    • the actual value should be 0 but we sampled the 20 records, not he full threshold possibilities

12.3 Sensitivity Analysis

Identifying the threshold that maximizes savings is good, but imperfect. Assumptions have been made.

  • Assume net revenue per employee = 250k
  • Assume average OT = 10%

You will never have perfect information therefore assumptions must be made

Even though you must make assumptions, you can still try different values to test your assumptions. This is why you do sensitivity analysis.

Next step is to understand the impact these assumptions have on expected savings. Develop a profitability heatmap illustrating the range of values including best, worst and most likely cases and effect on expected savings.

Will create a function then use purrr to iterate through the function.

calculate_savings_by_threshold_2 is a variant of calculate_savings_by_threshold that has avg_overtime_pct and net_revenue_per_employee as inputs. There are only a few changes. . .

  • added avg_overtime_pct = 0.10 and net_revenue_per_employee = 250000 Note the default values from the confusion matrix:

  • net_revenue_per_employee = net_revenue_per_employee is added
  • ~ attrition_cost * avg_overtime_pct is added

###calculate_savings_by_thresh_2 {#calculate_savings_by_thresh_2}

calculate_savings_by_threshold_2 <- function(data, h2o_model, threshold = 0, tnr = 0, fpr = 1, fnr = 0, tpr = 1,
                                             avg_overtime_pct = 0.10, net_revenue_per_employee = 250000) {
    data_0_tbl <- as.tibble(data)
  
    # 4.1 Calculating Expected Value With OT 
    pred_0_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_0_tbl)) %>% as.tibble() %>%
        bind_cols(data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime))
    
    ev_0_tbl <- pred_0_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1, salary = MonthlyIncome * 12,
                # Changed in _2 ----
                net_revenue_per_employee = net_revenue_per_employee) ) %>%
        mutate(cost_of_policy_change = 0) %>%
        mutate(expected_attrition_cost = Yes * (attrition_cost + cost_of_policy_change) +
                    No *  (cost_of_policy_change))
    
    total_ev_0_tbl <- ev_0_tbl %>%
        summarise(total_expected_attrition_cost_0 = sum(expected_attrition_cost))
    
    # 4.2 Calculating Expected Value With Targeted OT
    data_1_tbl <- data_0_tbl %>% add_column(Yes = pred_0_tbl$Yes) %>%
        mutate(OverTime = case_when(Yes >= threshold ~ factor("No", levels = levels(data_0_tbl$OverTime)),
                                    TRUE ~ OverTime)) %>% select(-Yes) 
    
    pred_1_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_1_tbl)) %>% as.tibble() %>%
        bind_cols(data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime), data_1_tbl %>% select(OverTime)) %>%
        rename(OverTime_0 = OverTime, OverTime_1 = OverTime1)
    
    ev_1_tbl <- pred_1_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1, salary = MonthlyIncome * 12,
                                                       # Changed in _2 ----
                                                       net_revenue_per_employee = net_revenue_per_employee)) %>%
        mutate(cost_of_policy_change = case_when(OverTime_1 == "No" & OverTime_0 == "Yes" 
                                                 ~ attrition_cost * avg_overtime_pct, TRUE ~ 0))%>%
        mutate(cb_tn = cost_of_policy_change, cb_fp = cost_of_policy_change,
               cb_fn = attrition_cost + cost_of_policy_change, cb_tp = attrition_cost + cost_of_policy_change,
               expected_attrition_cost = Yes * (tpr*cb_tp + fnr*cb_fn) + No * (tnr*cb_tn + fpr*cb_fp))
    
    total_ev_1_tbl <- ev_1_tbl %>%
        summarise(total_expected_attrition_cost_1 = sum(expected_attrition_cost))
    
    # 4.3 Savings Calculation
    
    savings_tbl <- bind_cols(total_ev_0_tbl, total_ev_1_tbl) %>%
        mutate(savings = total_expected_attrition_cost_0 - total_expected_attrition_cost_1,
               pct_savings = savings / total_expected_attrition_cost_0)
    
    return(savings_tbl$savings)}

12.3.1 Sensitivity Analysis

Time to iterate through calculate_savings_by_threshold_2.

First, use the threshold percentage that optimizes savings for all the iterations. We developed this before. This will be used to set the values in the sensitivity analysis.

## # A tibble: 1 x 6
##   threshold   tnr   fnr   fpr   tpr savings
##       <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1     0.181 0.826  0.25 0.174  0.75 539867.
## [1] 539867

Because many of the values will be static, build a partial function using the code above as a template. This is so powerful and yet so simple. Gotta love partial.

Test the partial function created. This is really helpful - it simplifies our function so we only need to supply the parameters we are iterating on with the sensitivity analysis.

## [1] 539867

It works - the same value was returned. The goal is to understand how this value changes with revenue per employee and overtime.

  • Create a list of the different values to be evaluated in the sensitivity analysis using seq
    • overtime range: 0.05, 0.1, 0.15, 0.2, 0.25, 0.3
    • revenue per employee: 200000, 250000, 300000, 350000, 400000
      • on an income statement, \(grossRevenue - COGS = netRevenue/NumOfEmployees\)
  • cross_df from purr is really helpful. From a list of variables, it produces all combinations of the list. Super useful for grid search and sensitivity analysis. A sample is provided below.{#cross_df}
    • cross_df can work with as many variables you want!
  • pmap_dbl is used again to map a function by rows - this time the rows are avg_overtime_pct and net_revenue_per_employee
    • returns a numeric value
    • map2 or map2_dbl would also work however they accept only 2 variable mappings so pmap is more flexible
    • recall pmap takes 2 arguments: .l and .f
      • the .l argument is getting its values from the cross_df output that has all the combination of values

## # A tibble: 10 x 2
##    avg_overtime_pct net_revenue_per_employee
##               <dbl>                    <dbl>
##  1             0.05                   200000
##  2             0.05                   250000
##  3             0.1                    200000
##  4             0.1                    350000
##  5             0.1                    400000
##  6             0.2                    400000
##  7             0.25                   250000
##  8             0.3                    250000
##  9             0.3                    300000
## 10             0.3                    350000
## # A tibble: 6 x 3
##   avg_overtime_pct net_revenue_per_employee savings
##              <dbl>                    <dbl>   <dbl>
## 1             0.05                   200000 552741.
## 2             0.1                    200000 446819.
## 3             0.15                   200000 340896.
## 4             0.2                    200000 234974.
## 5             0.25                   200000 129051.
## 6             0.3                    200000  23128.

Now the data is available to plot!

12.4 Visualizing Scenarios

Heat maps are a good visualization to show how 2 variables interact with a 3rd variable (the target)

  • scale_fill_gradient2 uses 2 gradient transitions by specifying low, mid and high arguments along with a midpoint to specify the middle transition point
    • there are variations like scale_fill_gradient that offers only a low and high point.
    • scale_fill_gradient_n allows for any number of gradients.
  • scale_x_continuous - use the same breaks used earlier: seq(0.05, 0.30, by = 0.05)

If you want more color variation, change the midpoint value. Try midpoint = mean(sensitivity_tbl$savings):

Understanding the profitability heatmap.

  • Provides guidance for unknowns - we do not know the OT, we do not know revenue per employee. There are no perfect assumptions. These can be accounted for by ranges of values of best, worst and most likely cases.
  • The heatmap is a great way to show how sensitive features are to ROI with 2 features.
  • In a different version of the analysis, a break-even point can be identified:

12.4.1 Optimization For Stock Options

Enhance the sensitivity analysis by adding stock options into the policy change simultaneously with OT reduction.

  1. Find the threshold
  2. Plot sensitivity

12.4.2 Find the threshold

Will assume these values:

  • avg_overtime_pct <- 0.10
  • net_revenue_per_employee <- 250000
  • stock_option_cost <- 5000
    • Targeted employees will have stock option level move from 0 to 1

Changes in the code below to support Stock Options

  • data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime, StockOptionLevel)) to capture the stock info
  • mutate(StockOptionLevel = case_when(Yes >= threshold & StockOptionLevel == 0 ~ factor("1", levels = levels(data_0_tbl$StockOptionLevel)), TRUE ~ StockOptionLevel)) to capture when stock option changes from No to Yes
  • ind_cols(data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime, StockOptionLevel), data_1_tbl %>% select(OverTime, StockOptionLevel)) to include stock option in prediction table
  • StockOptionLevel_0 = StockOptionLevel, StockOptionLevel_1 = StockOptionLevel1 to rename stocks option to be similar to Overtime feature naming
  • mutate(cost_SO = case_when(StockOptionLevel_1 == "1" & StockOptionLevel_0 == "0" ~ stock_option_cost, TRUE ~ 0)) to capture stock option cost when a change in options is found
data <- test_tbl
h2o_model <- automl_leader

threshold <- 0; tnr <- 0; fpr <- 1; fnr <- 0; tpr <- 1

avg_overtime_pct <- 0.10
net_revenue_per_employee <- 250000
stock_option_cost <- 5000

calculate_savings_by_threshold_3 <- function(data, h2o_model, threshold = 0, tnr = 0, fpr = 1, fnr = 0, tpr = 1,
                                   avg_overtime_pct = 0.10, net_revenue_per_employee = 250000, 
                                   stock_option_cost = 5000){
    data_0_tbl <- as.tibble(data)
    
    #Calculating Expected Value With OT 
    pred_0_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_0_tbl)) %>% as.tibble() %>%
        bind_cols(
            # Changed in _3 Select StockOptionLevel Col----
            data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime, StockOptionLevel))
    
    ev_0_tbl <- pred_0_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1, salary = MonthlyIncome * 12,
                # Changed in _2 ----
                net_revenue_per_employee = net_revenue_per_employee)) %>% mutate(cost_of_policy_change = 0) %>%
        mutate(expected_attrition_cost = Yes * (attrition_cost + cost_of_policy_change) + No *  (cost_of_policy_change))
    
    total_ev_0_tbl <- ev_0_tbl %>% summarise(total_expected_attrition_cost_0 = sum(expected_attrition_cost))
    
    #Calculating Expected Value With Targeted OT & Stock Option Policy
    data_1_tbl <- data_0_tbl %>% add_column(Yes = pred_0_tbl$Yes) %>%
        mutate(OverTime = case_when(Yes >= threshold ~ factor("No", levels = levels(data_0_tbl$OverTime)), 
                                    TRUE ~ OverTime)) %>%
        # Changed in _3 Create StockOption Var----
    mutate(StockOptionLevel = case_when(Yes >= threshold & StockOptionLevel == 0 
                              ~ factor("1", levels = levels(data_0_tbl$StockOptionLevel)), 
                              TRUE ~ StockOptionLevel)) %>%  select(-Yes) 
    
    pred_1_tbl <- h2o_model %>% h2o.predict(newdata = as.h2o(data_1_tbl)) %>% as.tibble() %>%
        # Changed in _3 Bind StockOption Col----
         bind_cols(data_0_tbl %>% select(EmployeeNumber, MonthlyIncome, OverTime, StockOptionLevel),
             data_1_tbl %>% select(OverTime, StockOptionLevel)) %>%
        rename(OverTime_0 = OverTime, OverTime_1 = OverTime1,
            # Changed in _3 Rename StockOptionLevel----
            StockOptionLevel_0 = StockOptionLevel, StockOptionLevel_1 = StockOptionLevel1)
    
    ev_1_tbl <- pred_1_tbl %>% mutate(attrition_cost = calculate_attrition_cost(n = 1, salary = MonthlyIncome * 12,
                # Changed in _2 ----
                net_revenue_per_employee = net_revenue_per_employee)) %>%
    # cost_OT
    mutate(cost_OT = case_when(OverTime_1 == "No" & OverTime_0 == "Yes" ~ avg_overtime_pct * MonthlyIncome * 12, TRUE ~ 0)) %>%
        # Changed in _3 Cost Stock----
        mutate(cost_SO = case_when(StockOptionLevel_1 == "1" & StockOptionLevel_0 == "0" ~ stock_option_cost, TRUE ~ 0)) %>%
        mutate(cost_of_policy_change = cost_OT + cost_SO) %>%
        mutate(cb_tn = cost_of_policy_change, cb_fp = cost_of_policy_change,
            cb_fn = attrition_cost + cost_of_policy_change, cb_tp = attrition_cost + cost_of_policy_change,
            expected_attrition_cost = Yes * (tpr*cb_tp + fnr*cb_fn) + No * (tnr*cb_tn + fpr*cb_fp)) 
    
    total_ev_1_tbl <- ev_1_tbl %>% summarise(total_expected_attrition_cost_1 = sum(expected_attrition_cost))
    
    #Savings Calculation
    savings_tbl <- bind_cols(total_ev_0_tbl, total_ev_1_tbl) %>%
        mutate(savings = total_expected_attrition_cost_0 - total_expected_attrition_cost_1,
               pct_savings = savings / total_expected_attrition_cost_0)
    
    return(savings_tbl$savings)}

Calculate the savings between doing nothing and max F1:

## [1] 809316
## # A tibble: 15 x 6
##    threshold   tnr    fnr     fpr    tpr  savings
##        <dbl> <dbl>  <dbl>   <dbl>  <dbl>    <dbl>
##  1    0.966  1     0.972  0       0.0278   51241.
##  2    0.737  0.995 0.667  0.00543 0.333   517721.
##  3    0.450  0.978 0.444  0.0217  0.556   838874.
##  4    0.315  0.929 0.361  0.0707  0.639  1026488.
##  5    0.225  0.891 0.25   0.109   0.75   1097668.
##  6    0.181  0.826 0.25   0.174   0.75   1131973.
##  7    0.128  0.772 0.222  0.228   0.778  1125033.
##  8    0.113  0.712 0.194  0.288   0.806  1122143.
##  9    0.0930 0.652 0.194  0.348   0.806  1079308.
## 10    0.0737 0.598 0.139  0.402   0.861  1056028.
## 11    0.0648 0.543 0.111  0.457   0.889  1014063.
## 12    0.0570 0.478 0.111  0.522   0.889   991530.
## 13    0.0513 0.424 0.0833 0.576   0.917   963847.
## 14    0.0471 0.359 0.0833 0.641   0.917   905913.
## 15    0.0431 0.310 0.0278 0.690   0.972   892787.
## # A tibble: 1 x 6
##   threshold   tnr   fnr   fpr   tpr  savings
##       <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>
## 1     0.181 0.826  0.25 0.174  0.75 1131973.
rates_by_threshold_optimized_tbl_3 %>% ggplot(aes(threshold, savings)) +
    # Vlines
    geom_vline(xintercept = max_f1_tbl$threshold, color = palette_light()[[5]], size = 2) +
    geom_vline(aes(xintercept = threshold), color = palette_light()[[3]], size = 2, 
               data = rates_by_threshold_optimized_tbl_3 %>% filter(savings == max(savings))) +
    
    # Points
    geom_line(color = palette_light()[[1]]) + geom_point(color = palette_light()[[1]]) +
    
    # F1 Max
    annotate(geom = "label", label = scales::dollar(max_f1_savings),
             x = max_f1_tbl$threshold, y = max_f1_savings, vjust = -1, color = palette_light()[[1]]) + 
    
    # Optimal Point
    geom_point(shape = 21, size = 5, color = palette_light()[[3]], data = rates_by_threshold_optimized_tbl_3 %>%
                   filter(savings == max(savings))) +
    geom_label(aes(label = scales::dollar(savings)), vjust = -2, color = palette_light()[[3]], 
               data = rates_by_threshold_optimized_tbl_3 %>% filter(savings == max(savings))) +
    
    # No OT Policy
    geom_point(shape = 21, size = 5, color = palette_light()[[2]], data = rates_by_threshold_optimized_tbl_3 %>%
                   filter(threshold == min(threshold))) +
    geom_label(aes(label = scales::dollar(savings)), vjust = -1, color = palette_light()[[2]], 
               data = rates_by_threshold_optimized_tbl_3 %>% filter(threshold == min(threshold))) +
    
    # Do Nothing Policy
    geom_point(shape = 21, size = 5, color = palette_light()[[2]], data = rates_by_threshold_optimized_tbl_3 %>%
                   filter(threshold == max(threshold))) +
    geom_label(aes(label = scales::dollar(round(savings, 0))), vjust = -1, color = palette_light()[[2]], 
               data = rates_by_threshold_optimized_tbl_3 %>% filter(threshold == max(threshold))) +
    
    # Aesthestics
    theme_tq() +  expand_limits(x = c(-.1, 1.1), y = 12e5) + 
    scale_x_continuous(labels = scales::percent, breaks = seq(0, 1, by = 0.2)) +
    scale_y_continuous(labels = scales::dollar) +
    labs(title = glue::glue("Optimization Results: Expected Savings Maximized at {max_f1_tbl$threshold * 100}"),
        x = "Threshold (%)", y = "Savings")

12.4.3 Plot Sensitivity

## # A tibble: 15 x 4
##    avg_overtime_pct net_revenue_per_employee stock_option_cost   savings
##               <dbl>                    <dbl>             <dbl>     <dbl>
##  1             0.05                   250000             25000   334708.
##  2             0.1                    250000              5000  1131973.
##  3             0.1                    250000             20000   471973.
##  4             0.1                    250000             25000   251973.
##  5             0.15                   250000              5000  1049239.
##  6             0.15                   250000             10000   829239.
##  7             0.15                   250000             20000   389239.
##  8             0.15                   250000             25000   169239.
##  9             0.2                    250000             15000   526504.
## 10             0.2                    250000             20000   306504.
## 11             0.25                   250000             15000   443769.
## 12             0.25                   250000             20000   223769.
## 13             0.3                    250000              5000   801035.
## 14             0.3                    250000             10000   581035.
## 15             0.3                    250000             25000   -78965.

If an employee likelihood to leave > threshold. Recall Stock Option Cost is the value required to move an employee from 0 (no stock options) to 1.

Breakeven is where the row and column transition from negative to positive.

13 Creating A Recommendation Algorithm

3 step process:

  1. Creating a “Correlation Funnel” - This is a discretized correlation plot that enables us to visualize cohort relationships to the target variable, attrition.
  2. Using A Recommendation Algorithm Worksheet To Develop Strategies - This is key to having a structured approach to developing data-driven recommendations. It’s also a communication tool for executives & process stakeholders to showcase the logic you used and to incorporate their input into the strategies.
  3. Implementing Strategies Into R Code - develop a function called recommend_strategies() that outputs strategies by employee.

The recommendation algorithm supports Step 3 under Encode Algorithms.

13.1 Recipes For Feature Discretization

https://tidymodels.github.io/recipes/articles/Ordering.html

Recall recipes was used in Data Preparation when data_preparation_part_2_machine_readable.R was built. (#startRecipes) While your project’s needs may vary, here is a suggested order of potential steps that should work for most problems:

  1. Impute
  2. Individual transformations for skewness and other issues
  3. Discretize (if needed and if you have no other choice)
  4. Create dummy variables
  5. Create interactions
  6. Normalization steps (center, scale, range, etc)
  7. Multivariate transformation (e.g. PCA, spatial sign, etc)
  • Numeric features need to be binned to compare cohorts within the population. Discretation is the process of converting numeric data to factors via binning
  • recipes::one_hot provides a column for every category when dummying variables versus the default of one less column than the number of variables This is very beneficial for correlation analysis interpretability (It does give some warnings - OK in this case b/c there is not enough data for breaks)

Recall what the training data looks like - it has been a while!

## Observations: 1,250
## Variables: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No,...
## $ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Tra...
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department               <fct> Sales, Research & Development, Resear...
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education                <fct> College, Below College, College, Mast...
## $ EducationField           <fct> Life Sciences, Life Sciences, Other, ...
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14,...
## $ EnvironmentSatisfaction  <fct> Medium, High, Very High, Very High, L...
## $ Gender                   <fct> Female, Male, Male, Female, Male, Mal...
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement           <fct> High, Medium, Medium, High, High, Hig...
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole                  <fct> Sales Executive, Research Scientist, ...
## $ JobSatisfaction          <fct> Very High, Medium, High, High, Medium...
## $ MaritalStatus            <fct> Single, Married, Single, Married, Mar...
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ Over18                   <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y...
## $ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, N...
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating        <fct> Excellent, Outstanding, Excellent, Ex...
## $ RelationshipSatisfaction <fct> Low, Very High, Medium, High, Very Hi...
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance          <fct> Bad, Better, Better, Better, Better, ...
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...
  • JobLevel and StockOptionLevel are really factors

Build the recipe for the training data and glimpse the result.

  • recall thisstep_dummy(all_nominal() dummies all categorical data
  • previously we did not use this option: step_discretize(all_numeric(), options = list(min_unique = 1))
  • Use the bake() to apply the recipe to the data.
## Observations: 1,250
## Variables: 141
## $ Age_bin_missing                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Age_bin1                            <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ Age_bin2                            <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0,...
## $ Age_bin3                            <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1,...
## $ Age_bin4                            <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ BusinessTravel_Non.Travel           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BusinessTravel_Travel_Rarely        <dbl> 1, 0, 1, 0, 1, 0, 1, 1, 0,...
## $ BusinessTravel_Travel_Frequently    <dbl> 0, 1, 0, 1, 0, 1, 0, 0, 1,...
## $ DailyRate_bin_missing               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ DailyRate_bin1                      <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 1,...
## $ DailyRate_bin2                      <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ DailyRate_bin3                      <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ DailyRate_bin4                      <dbl> 0, 0, 1, 1, 0, 0, 1, 1, 0,...
## $ Department_Human.Resources          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ Department_Research...Development   <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ Department_Sales                    <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ DistanceFromHome_bin_missing        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ DistanceFromHome_bin1               <dbl> 1, 0, 1, 0, 1, 1, 0, 0, 0,...
## $ DistanceFromHome_bin2               <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ DistanceFromHome_bin3               <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ DistanceFromHome_bin4               <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1,...
## $ Education_Below.College             <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0,...
## $ Education_College                   <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 0,...
## $ Education_Bachelor                  <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1,...
## $ Education_Master                    <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,...
## $ Education_Doctor                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EducationField_Human.Resources      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EducationField_Life.Sciences        <dbl> 1, 1, 0, 1, 0, 1, 0, 1, 1,...
## $ EducationField_Marketing            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EducationField_Medical              <dbl> 0, 0, 0, 0, 1, 0, 1, 0, 0,...
## $ EducationField_Other                <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ EducationField_Technical.Degree     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EmployeeNumber_bin_missing          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EmployeeNumber_bin1                 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ EmployeeNumber_bin2                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EmployeeNumber_bin3                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EmployeeNumber_bin4                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EnvironmentSatisfaction_Low         <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ EnvironmentSatisfaction_Medium      <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ EnvironmentSatisfaction_High        <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 0,...
## $ EnvironmentSatisfaction_Very.High   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1,...
## $ Gender_Female                       <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 0,...
## $ Gender_Male                         <dbl> 0, 1, 1, 0, 1, 1, 0, 1, 1,...
## $ HourlyRate_bin_missing              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ HourlyRate_bin1                     <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1,...
## $ HourlyRate_bin2                     <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0,...
## $ HourlyRate_bin3                     <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 0,...
## $ HourlyRate_bin4                     <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ JobInvolvement_Low                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobInvolvement_Medium               <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 1,...
## $ JobInvolvement_High                 <dbl> 1, 0, 0, 1, 1, 1, 0, 1, 0,...
## $ JobInvolvement_Very.High            <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ JobLevel_X1                         <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 0,...
## $ JobLevel_X2                         <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ JobLevel_X3                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ JobLevel_X4                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobLevel_X5                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Healthcare.Representative   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Human.Resources             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Laboratory.Technician       <dbl> 0, 0, 1, 0, 1, 1, 1, 1, 0,...
## $ JobRole_Manager                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Manufacturing.Director      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ JobRole_Research.Director           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Research.Scientist          <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0,...
## $ JobRole_Sales.Executive             <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobRole_Sales.Representative        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ JobSatisfaction_Low                 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ JobSatisfaction_Medium              <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ JobSatisfaction_High                <dbl> 0, 0, 1, 1, 0, 0, 0, 1, 1,...
## $ JobSatisfaction_Very.High           <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ MaritalStatus_Single                <dbl> 1, 0, 1, 0, 0, 1, 0, 0, 1,...
## $ MaritalStatus_Married               <dbl> 0, 1, 0, 1, 1, 0, 1, 0, 0,...
## $ MaritalStatus_Divorced              <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0,...
## $ MonthlyIncome_bin_missing           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MonthlyIncome_bin1                  <dbl> 0, 0, 1, 1, 0, 0, 1, 1, 0,...
## $ MonthlyIncome_bin2                  <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0,...
## $ MonthlyIncome_bin3                  <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ MonthlyIncome_bin4                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ MonthlyRate_bin_missing             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MonthlyRate_bin1                    <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ MonthlyRate_bin2                    <dbl> 0, 0, 0, 0, 0, 1, 1, 1, 1,...
## $ MonthlyRate_bin3                    <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 0,...
## $ MonthlyRate_bin4                    <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0,...
## $ NumCompaniesWorked_bin_missing      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NumCompaniesWorked_bin1             <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 1,...
## $ NumCompaniesWorked_bin2             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ NumCompaniesWorked_bin3             <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ NumCompaniesWorked_bin4             <dbl> 1, 0, 1, 0, 1, 0, 0, 0, 0,...
## $ OverTime_No                         <dbl> 0, 1, 0, 0, 1, 1, 0, 1, 1,...
## $ OverTime_Yes                        <dbl> 1, 0, 1, 1, 0, 0, 1, 0, 0,...
## $ PercentSalaryHike_bin_missing       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PercentSalaryHike_bin1              <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0,...
## $ PercentSalaryHike_bin2              <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ PercentSalaryHike_bin3              <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,...
## $ PercentSalaryHike_bin4              <dbl> 0, 1, 0, 0, 0, 0, 1, 1, 1,...
## $ PerformanceRating_Low               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PerformanceRating_Good              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PerformanceRating_Excellent         <dbl> 1, 0, 1, 1, 1, 1, 0, 0, 0,...
## $ PerformanceRating_Outstanding       <dbl> 0, 1, 0, 0, 0, 0, 1, 1, 1,...
## $ RelationshipSatisfaction_Low        <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ RelationshipSatisfaction_Medium     <dbl> 0, 0, 1, 0, 0, 0, 0, 1, 1,...
## $ RelationshipSatisfaction_High       <dbl> 0, 0, 0, 1, 0, 1, 0, 0, 0,...
## $ RelationshipSatisfaction_Very.High  <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0,...
## $ StockOptionLevel_X0                 <dbl> 1, 0, 1, 1, 0, 1, 0, 0, 1,...
## $ StockOptionLevel_X1                 <dbl> 0, 1, 0, 0, 1, 0, 0, 1, 0,...
## $ StockOptionLevel_X2                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ StockOptionLevel_X3                 <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ TotalWorkingYears_bin_missing       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TotalWorkingYears_bin1              <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0,...
## $ TotalWorkingYears_bin2              <dbl> 1, 1, 1, 1, 0, 1, 0, 0, 1,...
## $ TotalWorkingYears_bin3              <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,...
## $ TotalWorkingYears_bin4              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TrainingTimesLastYear_bin_missing   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ TrainingTimesLastYear_bin1          <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 1,...
## $ TrainingTimesLastYear_bin2          <dbl> 0, 1, 1, 1, 1, 0, 1, 0, 0,...
## $ TrainingTimesLastYear_bin3          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ WorkLifeBalance_Bad                 <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ WorkLifeBalance_Good                <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0,...
## $ WorkLifeBalance_Better              <dbl> 0, 1, 1, 1, 1, 0, 0, 1, 1,...
## $ WorkLifeBalance_Best                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsAtCompany_bin_missing          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsAtCompany_bin1                 <dbl> 0, 0, 1, 0, 1, 0, 1, 1, 0,...
## $ YearsAtCompany_bin2                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsAtCompany_bin3                 <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 1,...
## $ YearsAtCompany_bin4                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsInCurrentRole_bin_missing      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsInCurrentRole_bin1             <dbl> 0, 0, 1, 0, 1, 0, 1, 1, 0,...
## $ YearsInCurrentRole_bin2             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsInCurrentRole_bin3             <dbl> 1, 1, 0, 1, 0, 1, 0, 0, 1,...
## $ YearsInCurrentRole_bin4             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsSinceLastPromotion_bin_missing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsSinceLastPromotion_bin1        <dbl> 1, 1, 1, 0, 0, 0, 1, 1, 1,...
## $ YearsSinceLastPromotion_bin2        <dbl> 0, 0, 0, 1, 1, 1, 0, 0, 0,...
## $ YearsSinceLastPromotion_bin3        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsWithCurrManager_bin_missing    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsWithCurrManager_bin1           <dbl> 0, 0, 1, 1, 1, 0, 1, 1, 0,...
## $ YearsWithCurrManager_bin2           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ YearsWithCurrManager_bin3           <dbl> 1, 1, 0, 0, 0, 1, 0, 0, 0,...
## $ YearsWithCurrManager_bin4           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1,...
## $ Attrition_No                        <dbl> 0, 1, 0, 1, 1, 1, 1, 1, 1,...
## $ Attrition_Yes                       <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 0,...
  • Note all values are binary - 1s and 0s. Data has 141 variables.
  • tidy() is a context-specific function that converts non-data-frame objects to data frames (tibbles). From broom package.
    • recipes has a special implementation of tidy()

Below provides the steps that were taken

## # A tibble: 4 x 5
##   number operation type       trained skip 
##    <int> <chr>     <chr>      <lgl>   <lgl>
## 1      1 step      zv         TRUE    FALSE
## 2      2 step      num2factor TRUE    FALSE
## 3      3 step      discretize TRUE    FALSE
## 4      4 step      dummy      TRUE    FALSE

It is easy to return the details of the operation in the recipe hierarchy provided by tidy(recipe_obj). This s very helpful to understand the binning strategy the recipe function provided.

## # A tibble: 73 x 2
##    terms     value
##    <chr>     <dbl>
##  1 Age        -Inf
##  2 Age          30
##  3 Age          36
##  4 Age          43
##  5 Age         Inf
##  6 DailyRate  -Inf
##  7 DailyRate   464
##  8 DailyRate   798
##  9 DailyRate  1156
## 10 DailyRate   Inf
## # ... with 63 more rows
  • recall as.factor converts in the order the data appears in the dataset

13.2 Data Manipulation

Goal is to get the correlation Yes information from the train_corr_level. First recall what the function get_cor does. get_cor Review tidy eval if needed.

## function (data, target, use = "pairwise.complete.obs", fct_reorder = FALSE, 
##     fct_rev = FALSE) 
## {
##     feature_expr <- enquo(target)
##     feature_name <- quo_name(feature_expr)
##     data_cor <- data %>% mutate_if(is.character, as.factor) %>% 
##         mutate_if(is.factor, as.numeric) %>% cor(use = use) %>% 
##         as.tibble() %>% mutate(feature = names(.)) %>% select(feature, 
##         !(!feature_expr)) %>% filter(!(feature == feature_name)) %>% 
##         mutate_if(is.character, as_factor)
##     if (fct_reorder) {
##         data_cor <- data_cor %>% mutate(feature = fct_reorder(feature, 
##             !(!feature_expr))) %>% arrange(feature)
##     }
##     if (fct_rev) {
##         data_cor <- data_cor %>% mutate(feature = fct_rev(feature)) %>% 
##             arrange(feature)
##     }
##     return(data_cor)
## }
  • Attrion_No is not needed - only interested in employees that leave
  • Filter the records to only evaluate features that have some moderately strong correlation
  • After running get_cor, the table looks like this:
## Warning in stats::cor(x, ...): the standard deviation is zero
## # A tibble: 6 x 2
##   feature                           Attrition_Yes
##   <fct>                                     <dbl>
## 1 OverTime_No                             -0.240 
## 2 StockOptionLevel_X1                     -0.161 
## 3 JobLevel_X2                             -0.134 
## 4 Department_Research...Development       -0.105 
## 5 YearsInCurrentRole_bin4                 -0.102 
## 6 TotalWorkingYears_bin4                  -0.0994
  • create a new feature relationship turning the factors in feature to a new column with type character so it can be manipulated easily
  • tidyr::separate splits columns of character data into multiple columns using a separator
    • Want to keep only the text before the "_"
    • extra: If sep is a character vector, this controls what happens when there are too many pieces. There are three valid options:
      1. “warn” (the default): emit a warning and drop extra values.
      2. “drop”: drop any extra values without a warning. (In this case. drop text after the "_)
      3. “merge”: only splits at most length(into) times
    • This returns the values we will group by
  • mutate(feature_base = as_factor(feature_base) %>% fct_rev())
    • as_factor assigns factor levels in the order they appear. Therefore, Overtime would be assigned 1, JobLevel would be assigned 2. . . .
    • fct_rev() changes this so Overtime level would become the highest level number, JobLevel assigned one less and so on. . .
    • This code is needed so the resulting plot a bit later displays correctly (like a funnel). The higher levels are plotted at the top of the plot with each successive level ordered in a descending manner.

13.4 Recommendation Strategy Development Worksheet

The worksheet takes the information from the correlation analysis to develop strategies to implement a recommendation engine. The intent is to provide managers targeted strategies based on high risk factors.

Note in the worksheet the strategy group options provide a way to group features that appear to have affinity with one and other. These groups will be used in the Shiny App to be developed later.

Skip is selected under Potential Strategy because some features are not open for change/manipulation.

Strategy Group assignment is performed after the other columns are completed

This is where data scientists are important. They apply critical thinking to develop a coded algorithm combining data (data-driven) and knowledge of a system (business-driven). This recommendation algorithm becomes valuable to the business because once deployed, it drives decision making from process owners (in this case Managers)

13.4.1 Personal Development (Mentorship, Education)

The information below is copied from the Excel Spreadsheet

  • YearsAtCompany
    • YAC - High - Likely to stay / YAC - Low - Likely to leave
    • Tie promotion if Low to advance faster / Mentor if YAC low
  • TotalWorkingYears
    • TWY - High - Likely to stay /TWY - Low - Likely to leave
    • Tie Low TWY to training & formation/mentorship
  • YearsInCurrentROle
    • More time in current role related to lower attrition
    • Incentivize specialization or promote
      • more professional than personal - its a mix so modified below ( Incentivize specialization or promote / Mentorship Role)
  • JobInvolvement
    • High JI - likely to stay / low JI likely to leave
    • Create personal development plan if low / High seek leadership role
  • JobSatisfaction
    • Low JS more likely to leave
    • Low - personal development plan / High - Mentorship roles
  • PerformanceRating - ADDED
    • If low, target for personal development plan / High possible mentor or leader
    • If you know features should be included based of your knowledge of the business, add them

13.4.2 Good-Better-Best Approach

Using the information from the spreadsheet, develop logic that can be turned into code.

Good-Better-Best is just one type of approach.

Alternatively, consider for Personal Development, YearsAtCompany has the greatest separation in the Correlation Analysis Plot. Wouldn’t it’s position on the plot suggest it is the feature that needs the most attention and therefore the worst case? (I get that there is little you can really do as a manager for YAC. Just an example to try to understand the Best/Worse Case logic.) Response: This is a reasonable assessment. In reality, you’d come up with your logic, which may be imperfect. You’d present this to your management & stakeholders. They would recommend changes as necessary to tweak to their experience and the data, just as you are doing with me. This is the iterative process and communication. The goal is to get a story to tell, and they will suggest changes. This is good - because when you incorporate their changes, you gain buy-in!

  • Worst Case -Create Personal Development plan - Job Satisfaction - Job Involvement - ADDED PerformanceRating if low (because it is a logical thing to do)

  • Better Case
    • Promote training and formation
      • YearsAtCompany
      • TotalWorkingYears (Not using YearsInCurrentRole is part of the 2 selected above)
  • Best Case 1
    • Seek Mentorship Role
      • YearsInCurrentRole
      • YearsAtCompany
      • ADDED PerformanceRating if high, JobSatisfaction (Some people make great mentors, others are good leaders) Mentors tend to be knowledgeable, experienced and have good satisfaction metrics. Typically someone with 3-5 years has sufficient knowledge to become a mentor - this drive the comparison value

Best Case 2 - Seek Leadership Role - JobInvolvement - JobSatisfaction - ADDED PerformanceRating Leaders tend to be very involved and have good performance metrics Want to identify leaders early. The main difference between mentors and leaders is leaders do not have a time component Leaders have high JobInvolvement which is a measure of engagement.

Using the information from the above strategy, code the logic into R code.

Before coding, understand how to discover bins were divided in the recipe. This is easy to do using tidy again. Here is an example using YearsAtCompany:

## # A tibble: 4 x 5
##   number operation type       trained skip 
##    <int> <chr>     <chr>      <lgl>   <lgl>
## 1      1 step      zv         TRUE    FALSE
## 2      2 step      num2factor TRUE    FALSE
## 3      3 step      discretize TRUE    FALSE
## 4      4 step      dummy      TRUE    FALSE

Note the discretize type above. This tells use the binning occurs in step 3.

## # A tibble: 73 x 2
##    terms     value
##    <chr>     <dbl>
##  1 Age        -Inf
##  2 Age          30
##  3 Age          36
##  4 Age          43
##  5 Age         Inf
##  6 DailyRate  -Inf
##  7 DailyRate   464
##  8 DailyRate   798
##  9 DailyRate  1156
## 10 DailyRate   Inf
## # ... with 63 more rows

This provides all the discretized steps. In this example. only interested in YearsAtCompany:{#str_detect}

## # A tibble: 5 x 2
##   terms          value
##   <chr>          <dbl>
## 1 YearsAtCompany  -Inf
## 2 YearsAtCompany     3
## 3 YearsAtCompany     5
## 4 YearsAtCompany    10
## 5 YearsAtCompany   Inf

This defines the bins. Bin1 = <3, Bin2 = 3 - 4, Bin 3 = 5 - 9, Bin4 > 9. Knowing the definition of the bins is needed below. In practice, perform this function for each binned variable.

It is also important to know factor levels:

## [1] "Low"       "Medium"    "High"      "Very High"

The next step easy to do. Create an outline that provides sections for Worst Case, Better Case and Best Case. Then collect the measures that are part of each of these sections. Coding then becomes pretty easy! Just leveraging case_when with the logic that has been defined.

13.4.3 Recommendation: Personal Development Plan

To illustrate, the worst case is presented below and finished with extra code to see the results. This is importunate as you translate logic into code. While the logic may seem sound at first, if the resulting spread does not make intuitive sense. the metrics assigned may need to be tweaked so the balance of the results make sense and are balanced is such a way that action can be taken effectively.

## .
## Create Personal Development Plan              Retain and Maintain 
##                              553                              697

About 1/3 of the employees are being targeted to Create Personal Development Plan. That seems a bit much so change Job_Involvment from 12 to 1. This will reduce the number that appears to be more reasonable but the data shows thatJob_Involvement is an important measure so leave it at 2 for now.

The important thing to do is to always evaluate the outcome from a business perspective at each step.

  • Many assigned to Personal Development Plan - may be too much to manage. Use business domain knowledge to change case_when logic to reduce if appropriate (~308 if JobInvolvement == 1)
  • Best Case 2: Seek Leadership Role: When JI, JS and PR all >=4, only one record returned - too aggressive - vary to achieve results that satisfy business domain knowledge
## .
## Create Personal Development Plan   Promote Training and Formation 
##                              553                              216 
##              Retain and Maintain             Seek Leadership Role 
##                              111                              195 
##             Seek Mentorship Role 
##                              175

Progressive Targeting: The logic based system uses an if-style evaluation progressively isolating the most concerning groups to the least concerning groups.

Glimpse at the results:

## Observations: 1,250
## Variables: 7
## $ YearsAtCompany                <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5...
## $ TotalWorkingYears             <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17...
## $ YearsInCurrentRole            <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4,...
## $ JobInvolvement                <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4,...
## $ JobSatisfaction               <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2,...
## $ PerformanceRating             <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3,...
## $ personal_development_strategy <chr> "Seek Mentorship Role", "Create ...

The goal is not perfection. This will develop over iterations and decision-maker feedback before and during deployment.

13.4.4 Recommendation: Professional Development

Focused on promotion, rotation and specialization in their job. Rotation keeps people fresh, excited and helps understand multiple aspects of the business. Specialization focuses on becoming the best specialist. Often for highly technical people that love their job (high satisfaction) but are not interested in leadership.

Also use the Correlation Analysis Plot for guidance.

Progressively make it more difficult to become promoted. This is a good thing because it will incentivize the attribute of performance metrics that are important to the business. These key characteristics should be communicated to employees during performance reviews.

Focused on promotion, rotation and specialization in their job Rotation keeps people fresh, excited and helps understand multiple aspects of the business Specialization focuses on becoming the best specialist. Often for highly technical people that love their job (high satisfaction) but are not interested in leadership

JobLevel - Higher performers in entry level positions (Level 1) are leaving. Job Level 2 are staying.
- Promote faster for high performing employees

YearsAtCompany (YAC) - YAC - High - Likely to stay / YAC - Low - Likely to leave - Tie promotion if Low to advance faster / Mentor if YAC low

YearsInCurrentRole (YiCR)
- More time in current role related to lower attrition
- Incentivize specialization or promote - Preferred as main drive over YAC

Additional Features - JobInvolvement - important for promotion readiness, incentivizes involvement for leaders and early promotion - JobSatisfaction - Important for specialization, incentivizes satisfaction for mentors - PerformanceRating - Important for any promotion

Good Better Best Approach - Typical Career Paths: 1 - Leadership (management); 2- Specialist (Technical) - Both should have promotion schedules based on clear metrics

Ready for rotation: YearsInCurrentRole, JobInvovement, PerformanceRating (LOW) Ready for Promotion Level 2: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating Ready for Promotion Level 3: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating Ready for Promotion Level 4: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating Ready for Promotion Level 5: JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating Incentives Specialization: YearsInCurrentRole, JobInvovement, PerformanceRating

train_readable_tbl %>% 
     select(JobLevel, YearsInCurrentRole, JobInvolvement, JobSatisfaction, PerformanceRating) %>% 
     mutate_if(is.factor, as.numeric) %>% 
     mutate(
          professional_development_strategy = case_when(
               # Ready for rotation: YearsInCurrentRole, JobInvovement, PerformanceRating (LOW)
               YearsInCurrentRole >= 2 & JobSatisfaction <=2 ~ "Ready for Rotation",
               # Ready for Promotion Level 2:  JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating
               JobLevel == 1 & YearsInCurrentRole >= 2 & JobInvolvement >= 3 & 
                    PerformanceRating >= 3 ~ "Ready for Promotion",
               # Ready for Promotion Level 3:  JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating
               JobLevel == 2 & YearsInCurrentRole >= 2 & JobInvolvement >= 4 & 
                    PerformanceRating >= 3 ~ "Ready for Promotion",
               # Ready for Promotion Level 4:  JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating
               JobLevel == 3 & YearsInCurrentRole >= 3 & JobInvolvement >= 4 & 
                    PerformanceRating >= 3 ~ "Ready for Promotion",
               # Only a few added at this point but that is probaly OK, pretty high up in the company
               # Ready for Promotion Level 5:  JobLevel, YearsInCurrentRole, JobInvovement, PerformanceRating
               JobLevel == 4 & YearsInCurrentRole >= 4 & JobInvolvement >= 4 & 
                    PerformanceRating >= 3 ~ "Ready for Promotion",
               # Only a few added at this point but that is probaly OK, pretty high up in the company
               # Incentives Specialization:  YearsInCurrentRole, JobInvovement, PerformanceRating 
               YearsInCurrentRole >= 4 & JobSatisfaction >= 4 & 
                    PerformanceRating >= 3 ~ "Incentivizes Specialization",
               # Catch All
               TRUE ~ "Retain and Maintain")) %>% pull(professional_development_strategy) %>% table() 
## .
## Incentivizes Specialization         Ready for Promotion 
##                         138                         182 
##          Ready for Rotation         Retain and Maintain 
##                         384                         546

13.4.5 Recommendation: Work Environment Strategy

Good Better Best Approach

  • Improve work life: Overtime, WorklifeBalance
  • Monitor Business Travel: BusinessTravel, DistanceFromHome, WorkLifeBalance
  • Review Job Assignment: EnvironmentSatisfaction, YearsInCurrentRole
  • Promote Job Engagement: JobInvolvement - just a rational addition
## # A tibble: 5 x 2
##   work_environment_strategy     n
##   <chr>                     <int>
## 1 Improve Work-Life Balance   407
## 2 Monitor Business Travel     105
## 3 Promote Job Involvement     182
## 4 Retain and Maintain         423
## 5 Review Job Assignment       133

Review all the work performed so far:

## Observations: 1,250
## Variables: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No,...
## $ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Tra...
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department               <fct> Sales, Research & Development, Resear...
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education                <fct> College, Below College, College, Mast...
## $ EducationField           <fct> Life Sciences, Life Sciences, Other, ...
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14,...
## $ EnvironmentSatisfaction  <fct> Medium, High, Very High, Very High, L...
## $ Gender                   <fct> Female, Male, Male, Female, Male, Mal...
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement           <fct> High, Medium, Medium, High, High, Hig...
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole                  <fct> Sales Executive, Research Scientist, ...
## $ JobSatisfaction          <fct> Very High, Medium, High, High, Medium...
## $ MaritalStatus            <fct> Single, Married, Single, Married, Mar...
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ Over18                   <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y...
## $ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, N...
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating        <fct> Excellent, Outstanding, Excellent, Ex...
## $ RelationshipSatisfaction <fct> Low, Very High, Medium, High, Very Hi...
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance          <fct> Bad, Better, Better, Better, Better, ...
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...

14 Recommendation Function

data <-  train_readable_tbl
employee_number <- 19

recommend_function <- function(data, employee_number){
          data %>%  filter(EmployeeNumber == employee_number) %>% mutate_if(is.factor, as.numeric) %>%
          
          #Personal Development Strategy
          mutate(
               personal_development_strategy = case_when(
                    PerformanceRating == 1 | JobSatisfaction == 1 | 
                         JobInvolvement <= 2 ~ "Create Personal Development Plan",
                    YearsAtCompany < 3 | TotalWorkingYears < 6
                    ~ "Promote Training and Formation",
                    (YearsInCurrentRole > 3 | YearsAtCompany >= 5) & PerformanceRating >=3 & 
                         JobSatisfaction ==4 ~ "Seek Mentorship Role",
                    JobInvolvement >= 3 & JobSatisfaction >= 3 & 
                         PerformanceRating >= 3 ~ "Seek Leadership Role", TRUE ~ "Retain and Maintain")) %>% 
          # Professional Development Strategy
          mutate(
               professional_development_strategy = case_when(
                    YearsInCurrentRole >= 2 & JobSatisfaction <=2 ~ "Ready for Rotation",
                    JobLevel == 1 & YearsInCurrentRole >= 2 & JobInvolvement >= 3 & 
                         PerformanceRating >= 3 ~ "Ready for Promotion",
                    JobLevel == 2 & YearsInCurrentRole >= 2 & JobInvolvement >= 4 & 
                         PerformanceRating >= 3 ~ "Ready for Promotion",
                    JobLevel == 3 & YearsInCurrentRole >= 3 & JobInvolvement >= 4 & 
                         PerformanceRating >= 3 ~ "Ready for Promotion",
                    JobLevel == 4 & YearsInCurrentRole >= 4 & JobInvolvement >= 4 & 
                         PerformanceRating >= 3 ~ "Ready for Promotion",
                    YearsInCurrentRole >= 4 & JobSatisfaction >= 4 & 
                         PerformanceRating >= 3 ~ "Incentivizes Specialization",
                    TRUE ~ "Retain and Maintain")) %>% 
          # Work Environment Strategy
          mutate(
               work_environment_strategy = case_when(
                    OverTime ==2 | WorkLifeBalance ==1 ~ "Improve Work-Life Balance", 
                    (BusinessTravel ==3 | DistanceFromHome >= 10) & 
                         WorkLifeBalance ==2 ~ "Monitor Business Travel",
                    # WorkLifeBalance = 2 to differentiate from the case_when above
                    EnvironmentSatisfaction ==1 & YearsInCurrentRole >=2 ~ "Review Job Assignment",
                    JobInvolvement <=2 ~ "Promote Job Involvement", TRUE ~ "Retain and Maintain"))%>% 
          select(EmployeeNumber, personal_development_strategy, 
                 professional_development_strategy, work_environment_strategy)}

Test the recommendation algorithm.

## # A tibble: 1,250 x 1
##    EmployeeNumber
##             <dbl>
##  1              1
##  2              2
##  3              4
##  4              5
##  5              7
##  6              8
##  7             10
##  8             11
##  9             12
## 10             13
## # ... with 1,240 more rows
## # A tibble: 1 x 4
##   EmployeeNumber personal_developme~ professional_devel~ work_environment~
##            <dbl> <chr>               <chr>               <chr>            
## 1              1 Seek Mentorship Ro~ Incentivizes Speci~ Improve Work-Lif~
## # A tibble: 1 x 4
##   EmployeeNumber personal_developme~ professional_devel~ work_environment~
##            <dbl> <chr>               <chr>               <chr>            
## 1              2 Create Personal De~ Ready for Rotation  Promote Job Invo~
## # A tibble: 1 x 4
##   EmployeeNumber personal_developme~ professional_devel~ work_environment~
##            <dbl> <chr>               <chr>               <chr>            
## 1              4 Create Personal De~ Retain and Maintain Improve Work-Lif~
## # A tibble: 0 x 4
## # ... with 4 variables: EmployeeNumber <dbl>,
## #   personal_development_strategy <chr>,
## #   professional_development_strategy <chr>,
## #   work_environment_strategy <chr>
## # A tibble: 220 x 1
##    EmployeeNumber
##             <dbl>
##  1            228
##  2           1278
##  3           1250
##  4           2065
##  5           1767
##  6           1308
##  7             18
##  8            460
##  9           1369
## 10           1040
## # ... with 210 more rows
## # A tibble: 1 x 4
##   EmployeeNumber personal_developme~ professional_devel~ work_environment~
##            <dbl> <chr>               <chr>               <chr>            
## 1           1767 Create Personal De~ Retain and Maintain Improve Work-Lif~